Fingerprints datastore and stale fingerprint removal in de-duplication environments

ABSTRACT

A storage server is coupled to a storage device that stores blocks of data, and generates a fingerprint for each data block stored on the storage device. The storage server creates a fingerprints datastore that is divided into a primary datastore and a secondary datastore. The primary datastore comprises a single entry for each unique fingerprint and the secondary datastore comprises an entry having an identical fingerprint as an entry in the primary datastore. The storage server merges entries in a changelog with the entries in the primary datastore to identify duplicate data blocks in the storage device and frees the identified duplicate data blocks in the storage device. The storage server stores the entries that correspond to the freed data blocks to a third datastore and overwrites the primary datastore with the entries from the merged data that correspond to the unique fingerprints to create an updated primary datastore.

RELATED APPLICATIONS

This application is a Continuation of, and claims the priority benefitof, U.S. application Ser. No. 12/969,527 filed Dec. 15, 2010.

This application is related to U.S. patent application Ser. No.12/969,531, entitled “SEGMENTED FINGERPRINT DATASTORE AND SCALING AFINGERPRINT DATASTORE IN DEDUPLICATION ENVIRONMENTS,” which is assignedto the same assignee as the present application.

FIELD OF THE INVENTION

At least one embodiment of the present invention pertains to datastorage systems, and more particularly, to a technique for identifyingand removing stale entries.

BACKGROUND

In a data storage system it is desirable to use storage space asefficiently as possible, to avoid wasting storage space. One type ofsystem in which this concern can be particularly important is a storageserver, such as a file server. File servers and other types of storageservers often are used to maintain extremely large quantities of data.In such systems, efficiency of storage space utilization is critical.

Data containers (e.g., files) maintained by a file system generally aremade up of individual blocks of data. A common block size is fourkilobytes. In a large file system, it is common to find duplicateoccurrences of individual blocks of data. Duplication of data blocks mayoccur when, for example, two or more files have some data in common orwhere a given set of data occurs at multiple places within a given file.Duplication of data blocks results in inefficient use of storage space.

A de-duplication process eliminates redundant data within a file system.A deduplication process can occur in-line and offline. When ade-duplication process occurs while data is being written to a filesystem, the process can be referred to as ‘in-line deduplication.’ Whena de-duplication process occurs after data is written to a storagedevice (e.g., disk), the process can be referred to as ‘offlinede-duplication.’ A deduplication process can further be described, forexample, to include two operations, such as a ‘de-duplication operation’(identify and eliminating duplicate data blocks) and a ‘verifyoperation’ (identify and removing stale entries from a fingerprintsdatastore). The de-duplication process keeps a fingerprint value forevery block within a file system in a fingerprints datastore (FPDS).This fingerprints datastore is used to find redundant blocks of datawithin the file system during a de-duplication operation. For example,typically, the fingerprint datastore is sorted on the basis offingerprints to efficiently find potential duplicates. However,maintaining one entry for each block in a file system increases the sizeof the fingerprints datastore drastically. An increased fingerprintsdatastore size consumes more time during a de-duplication operation andverify operation.

De-duplication involves the fingerprints datastore having somefingerprint entries that are stale. A stale fingerprint entry is anentry that has a fingerprint that corresponds to a data block that hasbeen deleted (freed) or overwritten, for example, during ade-duplication operation. The stale entries do not contribute to anyspace savings and add significant overhead in subsequent operations onthe fingerprints datastore. These stale entries can be removed, forexample, using a verify operation. Current implementations of a verifyoperation include two stages. In stage one, the fingerprints datastoreis first sorted in order by <file identifier, block offset in a file,time stamp>, to check whether a fingerprint entry is stale or not foreach entry. The fingerprints datastore is then overwritten with only thestale-free entries. In stage two, the output from stage one is sortedback to its original order (e.g., fingerprint, inode, file block number(fbn)). Several problems with this conventional approach include sortingthe fingerprints datastore twice with each verify operation and thesecond sort is unnecessary to remove the stale entries. Moreover, theconventional approach overwrites the entire FPDS with stale-freeentries, even if the number of stale entries is a small percentage ofthe FPDS. In addition, a verify operation is typically a blockingoperation, and thus, if a verify operation is executing on the FPDS,then no other deduplication (sharing) operation can execute becausede-duplication operations and verify operations should work from aconsistent copy of the FPDS.

De-duplication includes logging fingerprints of any new data block thatis written or updated in the file system into a changelog file. Thechangelog file is merged with fingerprints datastore to find duplicateblocks and to eliminate the duplicate data blocks. During this process,the fingerprints datastore is overwritten with the merged data withevery de-duplication operation. Overwriting the entire fingerprintsdatastore with every de-duplication operation, however, can involve alarge amount of write cost.

In addition, current de-duplication operations use logical informationto identify blocks in a volume and their associated fingerprints.De-duplication maintains a fingerprint entry in the fingerprintsdatastore for each <inode, fbn>. That means, if a block is shared ‘n’times, the fingerprints datastore will have ‘n’ entries for a singlefingerprint value. In cases, however, where there is a significantamount of logical data, a fingerprints datastore cannot scaleproportionately.

SUMMARY

One aspect of a de-duplication operation generates a fingerprint foreach data block stored on a storage device in storage. Thede-duplication operation divides a fingerprints datastore into a primarydatastore and a secondary datastore. The primary datastore comprises asingle entry for each unique fingerprint and the secondary datastorecomprises an entry having an identical fingerprint as an entry in theprimary datastore. The de-duplication operation merges entries in achangelog with the entries in the primary datastore to identifyduplicate data blocks in the storage device and frees the identifiedduplicate data blocks in the storage device. The de-duplicationoperation stores the entries that correspond to the freed data blocks toa third datastore and overwrites the primary datastore with the entriesfrom the merged data that correspond to the unique fingerprints tocreate an updated primary datastore.

During a verify operation, stale fingerprint entries are identified inthe fingerprints datastore. A stale fingerprint entry is an entry thathas a fingerprint that corresponds to a data block that has been deleted(freed) or overwritten, for example, during a de-duplication operation.One aspect of the verify operation identifies stale entries in thefingerprints datastore and writes stale entry information for theidentified stale entries to a stale entries datastore. A subsequentde-duplication operation removes the stale entries in the fingerprintsdatastore using the stale entry information. Another aspect of a verifyoperation manages a verify operation as a background operation so thatif any de-duplication request is made while a verify operation isexecuting, the de-duplication request can be served, which in turn helpsdecrease customer response time.

The present invention is described in conjunction with systems, clients,servers, methods, and computer-readable media of varying scope. Inaddition to the aspects of the present invention described in thissummary, further aspects of the invention will become apparent byreference to the drawings and by reading the detailed description thatfollows.

BRIEF DESCRIPTION OF THE DRAWINGS

One or more embodiments of the present invention are illustrated by wayof example and not limitation in the figures of the accompanyingdrawings, in which like references indicate similar elements and inwhich:

FIG. 1A illustrates a network storage system in which embodiments may beimplemented.

FIG. 1B illustrates a distributed or clustered architecture for anetwork storage system in which embodiments may be implemented in analternative embodiment.

FIG. 2 is a block diagram of an illustrative embodiment of storageserver in which embodiments may be implemented.

FIG. 3 illustrates an embodiment of the storage operating system of FIG.2 in which embodiments may be implemented.

FIGS. 4A-4B show block diagrams of storage environments in whichembodiments can be implemented.

FIG. 4C is a flow diagram showing a high-level de-duplication method,according to certain embodiments.

FIGS. 5A-5C are block diagrams of data sharing with respect to twofiles, according to certain embodiments.

FIG. 6 is a state diagram showing the states which a data block canhave, according to certain embodiments.

FIG. 7 is a block diagram showing elements of a de-duplication modulecoupled to a primary fingerprints datastore (FPDS) and a secondary FPDS,according to certain embodiments.

FIG. 8A is a block diagram for dividing a FPDS into a primaryfingerprints datastore and a secondary fingerprints datastore, accordingto certain embodiments.

FIG. 8B is a flow diagram of a method for creating a primary and asecondary datastore, according to certain embodiments.

FIG. 9A is a block diagram for de-duplication using a primary and asecondary datastore, according to certain embodiments.

FIG. 9B is a flow diagram of a method for de-duplication using a primaryand a secondary datastore, according to certain embodiments.

FIG. 10 is a block diagram showing elements of a de-duplication modulecoupled to a segmented fingerprints datastore, according to certainembodiments.

FIGS. 11A-11B are diagrams for identifying and removing fingerprintentries corresponding to duplicate data blocks using a segmentedfingerprints datastore, according to certain embodiments.

FIG. 12 is a block diagram showing elements of a de-duplication modulefor referring to a data block in a volume using a virtual volume blocknumber (VVBN), according to certain embodiments.

FIG. 13 is a block diagram of mapping a data block from a file to astorage device (e.g., disk), according to certain embodiments.

FIGS. 14A-14B are diagrams for addressing a data block in a volume usinga virtual volume block number (VVBN), according to certain embodiments.

FIG. 15 is a block diagram showing elements of a stale fingerprintmanager for identifying and removing stale fingerprint entries when anext deduplication operation is invoked, according to certainembodiments.

FIGS. 16A-16B are diagrams for removing stale fingerprint entries from afingerprints datastore (FPDS) when a next de-duplication operation isinvoked, according to certain embodiments.

FIGS. 17A-17B are diagrams of a verify operation to identify and removestale fingerprint entries using a primary FPDS and a secondary FPDS,according to certain embodiments.

FIG. 18 is a flow diagram of a method for executing a verify operation(stale fingerprint entry removal) using VVBNs (virtual volume blocknumbers), according to certain embodiments.

FIG. 19 is a block diagram showing elements of a de-duplication modulefor executing a verify operation (stale fingerprint entry removal) as abackground operation, according to certain embodiments.

FIG. 20 is a flow diagram of a method for executing a verify operation(stale fingerprint entry removal) as a background operation, accordingto certain embodiments.

FIG. 21 is a flow diagram of a method for computing a fingerprint for adata block, according to certain embodiments.

FIG. 22 is a flow diagram showing a method of sorting a fingerprintsdatastore, according to certain embodiments.

FIG. 23 is a flow diagram showing the method of freeing a data block,according to certain embodiments.

DETAILED DESCRIPTION

In the following detailed description of embodiments of the invention,reference is made to the accompanying drawings in which like referencesindicate similar elements, and in which is shown by way of illustrationspecific embodiments in which the invention may be practiced. Theseembodiments are described in sufficient detail to enable those skilledin the art to practice the invention, and it is to be understood thatother embodiments may be utilized and that logical, mechanical,electrical, functional, and other changes may be made without departingfrom the scope of the present invention. The following detaileddescription is, therefore, not to be taken in a limiting sense, and thescope of the present invention is defined only by the appended claims.

Data containers (e.g., files) maintained by a file system generally aremade up of individual blocks of data stored on storage devices.Duplication of data blocks may occur when, for example, two or morefiles have some data in common or where a given set of data occurs atmultiple places within a given file has. Duplication of data blocksresults in inefficient use of storage space. De-duplication eliminatesredundant data within a file system. As described herein, de-duplicationincludes two operations, a ‘de-duplication operation’ to identify andeliminate duplicate data blocks, and a ‘verify operation’ to identifyand remove stale entries (e.g., stale records) from a fingerprintsdatastore. Embodiments of a de-duplication operation and a verifyoperation are described in greater detail below in conjunction withFIGS. 4A-4C.

One aspect of the de-duplication operation divides a fingerprintsdatastore into multiple parts, such as a primary fingerprints datastoreand a secondary fingerprints datastore. De-duplication operations canuse this primary fingerprints datastore, which will be smaller in sizecompared to a secondary fingerprints datastore, to identify duplicateblocks of data, to reduce the overall time taken to find potentialduplicate blocks. Another aspect of the de-duplication operationorganizes the fingerprints datastore as master datastore and stores theentries in the changelogs as datastore segments to avoid overwriting themaster datastore during every de-duplication operation. Another aspectof a de-duplication operation references a block in a volume uniquely bymaintaining a single entry for each fingerprint in the fingerprintsdatastore using a VVBN (Virtual Volume Block Number), thus allowing thefingerprints datastore to scale easily.

During a verify operation, stale fingerprint entries are identified andremoved from the fingerprints datastore. One aspect of the verifyoperation optimizes current stale entries removal by reducing the timeto sort the fingerprints datastore by recording stale entry informationto a separate datastore (e.g., stale entries file) which would beproportional to size of stale entries in the fingerprints datastore,rather than rewriting the entire fingerprints datastore. In response todetecting a request for a subsequent deduplication operation, the staleentry datastore is merged with the fingerprints datastore and the staleentries are removed prior to the execution of the de-duplicationoperation when there is a full read/write of the entire fingerprintsdatastore. Thus, the conventional second sort of the fingerprintsdatastore is eliminated. Another aspect of a verify operation managesthe verify operation as a background operation so that if anyde-duplication request is made while a verify operation is executing,the de-duplication request can be served, which in turn helps decreasecustomer response time.

FIGS. 1A-1B, 2, and 3 show operating environments in which embodimentsas described below can be implemented. FIG. 1A shows a network storagesystem 100 in which embodiments can be implemented. Storage servers 110(storage servers 110A, 110B) each manage multiple storage units 170(storage 170A, 170B) that include mass storage devices. These storageservers provide data storage services to one or more clients 102 througha network 130. Network 130 may be, for example, a local area network(LAN), wide area network (WAN), metropolitan area network (MAN), globalarea network such as the Internet, a Fibre Channel fabric, or anycombination of such interconnects. Each of clients 102 may be, forexample, a conventional personal computer (PC), server-class computer,workstation, handheld computing or communication device, or otherspecial or general purpose computer.

Storage of data in storage units 170 is managed by storage servers 110which receive and respond to various read and write requests fromclients 102, directed to data stored in or to be stored in storage units170. Storage units 170 constitute mass storage devices which caninclude, for example, flash memory, magnetic or optical disks, or tapedrives, illustrated as storage devices 171 (171A, 171B). The storagedevices 171 can further be organized into arrays (not illustrated)implementing a Redundant Array of Inexpensive Disks/Devices (RAID)scheme, whereby storage servers 110 access storage units 170 using oneor more RAID protocols known in the art.

Storage servers 110 can provide file-level service such as used in anetwork attached storage (NAS) environment, block-level service such asused in a storage area network (SAN) environment, a service which iscapable of providing both file-level and block-level service, or anyother service capable of providing other data access services. Althoughstorage servers 110 are each illustrated as single units in FIG. 1A, astorage server can, in other embodiments, constitute a separate networkelement or module (an “N-module”) and disk element or module (a“D-module”). In one embodiment, the D-module includes storage accesscomponents for servicing client requests. In contrast, the N-moduleincludes functionality that enables client access to storage accesscomponents (e.g., the D-module) and may include protocol components,such as Common Internet File System (CIFS), Network File System (NFS),or an Internet Protocol (IP) module, for facilitating such connectivity.Details of a distributed architecture environment involving D-modulesand N-modules are described further below with respect to FIG. 1 Bandembodiments of an D-module and an N-module are described further belowwith respect to FIG. 3.

In yet other embodiments, storage servers 110 are referred to as networkstorage subsystems. A network storage subsystem provides networkedstorage services for a specific application or purpose. Examples of suchapplications include database applications, web applications, EnterpriseResource Planning (ERP) applications, etc., e.g., implemented in aclient. Examples of such purposes include file archiving, backup,mirroring, etc., provided, for example, on archive, backup, or secondarystorage server connected to a primary storage server. A network storagesubsystem can also be implemented with a collection of networkedresources provided across multiple storage servers and/or storage units.

In the embodiment of FIG. 1A, one of the storage servers (e.g., storageserver 110A) functions as a primary provider of data storage services toclient 102. Data storage requests from client 102 are serviced usingstorage devices 171A organized as one or more storage objects. Asecondary storage server (e.g., storage server 110B) takes a standbyrole in a mirror relationship with the primary storage server,replicating storage objects from the primary storage server to storageobjects organized on storage devices of the secondary storage server(e.g., storage devices 1708). In operation, the secondary storage serverdoes not service requests from client 102 until data in the primarystorage object becomes inaccessible such as in a disaster with theprimary storage server, such event considered a failure at the primarystorage server. Upon a failure at the primary storage server, requestsfrom client 102 intended for the primary storage object are servicedusing replicated data (i.e. the secondary storage object) at thesecondary storage server.

It will be appreciate that in other embodiments, network storage system100 may include more than two storage servers. In these cases,protection relationships may be operative between various storageservers in system 100 such that one or more primary storage objects fromstorage server 110A may be replicated to a storage server other thanstorage server 110B (not shown in this figure). Secondary storageobjects may further implement protection relationships with otherstorage objects such that the secondary storage objects are replicated,e.g., to tertiary storage objects, to protect against failures withsecondary storage objects. Accordingly, the description of a single-tierprotection relationship between primary and secondary storage objects ofstorage servers 110 should be taken as illustrative only.

FIG. 1B illustrates a block diagram of a distributed or clusterednetwork storage system 120 which may implement embodiments. System 120may include storage servers implemented as nodes 110 (nodes 110A, 110B)which are each configured to provide access to storage devices 171. InFIG. 1B, nodes 110 are interconnected by a cluster switching fabric 125,which may be embodied as an Ethernet switch.

Nodes 110 may be operative as multiple functional components thatcooperate to provide a distributed architecture of system 120. To thatend, each node 110 may be organized as a network element or module(N-module 121A, 121B), a disk element or module (D-module 122A, 122B),and a management element or module (M-host 123A, 123B). In oneembodiment, each module includes a processor and memory for carrying outrespective module operations. For example, N-module 121 may includefunctionality that enables node 110 to connect to client 102 via network130 and may include protocol components such as a media access layer,Internet Protocol (IP) layer, Transport Control Protocol (TCP) layer,User Datagram Protocol (UDP) layer, and other protocols known in theart.

In contrast, D-module 122 may connect to one or more storage devices 171via cluster switching fabric 125 and may be operative to service accessrequests on devices 170. In one embodiment, the D-module 122 includesstorage access components such as a storage abstraction layer supportingmulti-protocol data access (e.g., Common Internet File System protocol,the Network File System protocol, and the Hypertext Transfer Protocol),a storage layer implementing storage protocols (e.g., RAID protocol),and a driver layer implementing storage device protocols (e.g., SmallComputer Systems Interface protocol) for carrying out operations insupport of storage access operations. In the embodiment shown in FIG.1B, a storage abstraction layer (e.g., file system) of the D-moduledivides the physical storage of devices 170 into storage objects.Requests received by node 110 (e.g., via N-module 121) may thus includestorage object identifiers to indicate a storage object on which tocarry out the request.

Also operative in node 110 is M-host 123 which provides cluster servicesfor node 110 by performing operations in support of a distributedstorage system image, for instance, across system 120. M-host 123provides cluster services by managing a data structure such as areplicated database (RDB) 124 (RDB 124A, RDB 124B) which containsinformation used by N-module 121 to determine which D-module 122 “owns”(services) each storage object. The various instances of RDB 124 acrossrespective nodes 110 may be updated regularly by M-host 123 usingconventional protocols operative between each of the M-hosts (e.g.,across network 130) to bring them into synchronization with each other.A client request received by N-module 121 may then be routed to theappropriate D-module 122 for servicing to provide a distributed storagesystem image.

It should be noted that while FIG. 1B shows an equal number of N- andD-modules constituting a node in the illustrative system, there may bedifferent number of N- and D-modules constituting a node in accordancewith various embodiments. For example, there may be a number ofN-modules and D-modules of node 110A that does not reflect a one-to-onecorrespondence between the N- and D-modules of node 110B. As such, thedescription of a node comprising one N-module and one D-module for eachnode should be taken as illustrative only.

FIG. 2 is a block diagram of an embodiment of a storage server, such asstorage servers 110A and 110B of FIG. 1A, embodied as a general orspecial purpose computer including a processor 202, a memory 210, anetwork adapter 220, a user console 212 and a storage adapter 240interconnected by a system bus 250, such as a convention PeripheralComponent Interconnect (PCI) bus.

Memory 210 includes storage locations addressable by processor 202,network adapter 220 and storage adapter 240 for storingprocessor-executable instructions and data structures associated withembodiments. A storage operating system 214, portions of which aretypically resident in memory 210 and executed by processor 202,functionally organizes the storage server by invoking operations insupport of the storage services provided by the storage server. It willbe apparent to those skilled in the art that other processing means maybe used for executing instructions and other memory means, includingvarious computer readable media, may be used for storing programinstructions pertaining to the embodiments described herein. It willalso be apparent that some or all of the functionality of the processor202 and executable software can be implemented by hardware, such asintegrated currents configured as programmable logic arrays, ASICs, andthe like.

Network adapter 220 comprises one or more ports to couple the storageserver to one or more clients over point-to-point links or a network.Thus, network adapter 220 includes the mechanical, electrical andsignaling circuitry needed to couple the storage server to one or moreclient over a network. Each client may communicate with the storageserver over the network by exchanging discrete frames or packets of dataaccording to pre-defined protocols, such as TCP/IP.

Storage adapter 240 includes a plurality of ports having input/output(I/O) interface circuitry to couple the storage devices (e.g., disks) tobus 221 over an I/O interconnect arrangement, such as a conventionalhigh-performance, FC or SAS link topology. Storage adapter 240 typicallyincludes a device controller (not illustrated) comprising a processorand a memory for controlling the overall operation of the storage unitsin accordance with read and write commands received from storageoperating system 214. As used herein, data written by a devicecontroller in response to a write command is referred to as “writedata,” whereas data read by device controller responsive to a readcommand is referred to as “read data.”

User console 212 enables an administrator to interface with the storageserver to invoke operations and provide inputs to the storage serverusing a command line interface (CLI) or a graphical user interface(GUI). In one embodiment, user console 212 is implemented using amonitor and keyboard.

When implemented as a node of a cluster, such as cluster 120 of FIG. 1B,the storage server further includes a cluster access adapter 230 (shownin phantom) having one or more ports to couple the node to other nodesin a cluster. In one embodiment, Ethernet is used as the clusteringprotocol and interconnect media, although it will apparent to one ofskill in the art that other types of protocols and interconnects can byutilized within the cluster architecture.

FIG. 3 is a block diagram of a storage operating system, such as storageoperating system 214 of FIG. 2 that implements embodiments. The storageoperating system comprises a series of software layers executed by aprocessor, such as processor 202 of FIG. 2, and organized to form anintegrated network protocol stack or, more generally, a multi-protocolengine 325 that provides data paths for clients to access informationstored on the storage server using block and file access protocols.

Multi-protocol engine 325 includes a media access layer 312 of networkdrivers (e.g., gigabit Ethernet drivers) that interface with networkprotocol layers, such as the IP layer 314 and its supporting transportmechanisms, the TCP layer 316 and the User Datagram Protocol (UDP) layer315. A file system protocol layer provides multi-protocol file accessand, to that end, includes support for the Direct Access File System(DAFS) protocol 318, the NFS protocol 320, the CIFS protocol 322 and theHypertext Transfer Protocol (HTTP) protocol 324. A VI layer 326implements the VI architecture to provide direct access transport (DAT)capabilities, such as RDMA, as required by the DAFS protocol 318. AniSCSI driver layer 328 provides block protocol access over the TCP/IPnetwork protocol layers, while a FC driver layer 330 receives andtransmits block access requests and responses to and from the storageserver. In certain cases, a Fibre Channel over Ethernet (FCoE) layer(not shown) may also be operative in multi-protocol engine 325 toreceive and transmit requests and responses to and from the storageserver. The FC and iSCSI drivers provide respective FC- andiSCSI-specific access control to the blocks and, thus, manage exports ofluns to either iSCSI or FCP or, alternatively, to both iSCSI and FCPwhen accessing blocks on the storage server.

The storage operating system also includes a series of software layersorganized to form a storage server 365 that provides data paths foraccessing information stored on storage devices. Information may includedata received from a client, in addition to data accessed by the storageoperating system in support of storage server operations such as programapplication data or other system data. Preferably, client data may beorganized as one or more logical storage objects (e.g., volumes) thatcomprise a collection of storage devices cooperating to define anoverall logical arrangement. In one embodiment, the logical arrangementmay involve logical volume block number (vbn) spaces, wherein eachvolume is associated with a unique vbn.

File system 360 implements a virtualization system of the storageoperating system through the interaction with one or more virtualizationmodules (illustrated as a SCSI target module 335). SCSI target module335 is generally disposed between drivers 328, 330 and file system 360to provide a translation layer between the block (lun) space and thefile system space, where luns are represented as blocks. In oneembodiment, file system 360 implements a WAFL (write anywhere filelayout) file system having an on-disk format representation that isblock-based using, e.g., 4 kilobyte (KB) blocks and using a datastructure such as index nodes (“inodes”) to identify files and fileattributes (such as creation time, access permissions, size and blocklocation). File system 360 uses files to store metadata describing thelayout of its file system, including an inode file, which directly orindirectly references (points to) the underlying data blocks of a file.

Operationally, a request from a client is forwarded as a packet over thenetwork and onto the storage server where it is received at a networkadapter. A network driver such as layer 312 or layer 330 processes thepacket and, if appropriate, passes it on to a network protocol and fileaccess layer for additional processing prior to forwarding to filesystem 360. There, file system 360 generates operations to load(retrieve) the requested data from the storage devices if it is notresident “in core”, i.e., in memory 223. If the information is not inmemory, file system 360 accesses the inode file to retrieve a logicalvbn and passes a message structure including the logical vbn to the RAIDsystem 380. There, the logical vbn is mapped to a disk identifier anddevice block number (disk,dbn) and sent to an appropriate driver of diskdrive system 385. The disk driver accesses the dbn from the specifieddisk and loads the requested data block(s) in memory for processing bythe storage server. Upon completion of the request, the node (andoperating system 300) returns a reply to the client over the network.

It should be noted that the software “path” through the storageoperating system layers described above needed to perform data storageaccess for the client request received at the storage server adaptableto the embodiments may alternatively be implemented in hardware. Thatis, in an alternate embodiment, a storage access request data path maybe implemented as logic circuitry embodied within a field programmablegate array (FPGA) or an application specific integrated circuit (ASIC).This type of hardware implementation increases the performance of thestorage service provided by the storage server in response to a requestissued by a client. Moreover, in another alternate embodiment, theprocessing elements of adapters 220, 240 may be configured to offloadsome or all of the packet processing and storage access operations,respectively, from processor 202, to thereby increase the performance ofthe storage service provided by the storage server. It is expresslycontemplated that the various processes, architectures and proceduresdescribed herein can be implemented in hardware, firmware or software.

When implemented in a cluster, data access components of the storageoperating system may be embodied as D-module 350 for accessing datastored on a storage device (e.g., disk). In contrast, multi-protocolengine 325 may be embodied as N-module 310 to perform protocoltermination with respect to a client issuing incoming access over thenetwork, as well as to redirect the access requests to any otherN-module in the cluster. A cluster services system 336 may furtherimplement an M-host (e.g., M-host 301) to provide cluster services forgenerating information sharing operations to present a distributed filesystem image for the cluster. For instance, media access layer 312 maysend and receive information packets between the various clusterservices systems of the nodes to synchronize the replicated databases ineach of the nodes.

In addition, a cluster fabric (CF) interface module 340 (CF interfacemodules 340A, 340B) may facilitate intra-cluster communication betweenN-module 310 and D-module 350 using a CF protocol 370. For instance,D-module 350 may expose a CF application programming interface (API) towhich N-module 310 (or another D-module not shown) issues calls. To thatend, CF interface module 340 can be organized as a CF encoder/decoderusing local procedure calls (LPCs) and remote procedure calls (RPCs) tocommunicate a file system command to between D-modules residing on thesame node and remote nodes, respectively.

The operating system 300 also includes a user interface module 365 and ade-duplication module 390 logically on top of the file system 360. Theuser interface module 365 may implement a command line interface and/ora graphical user interface, which may be accessed by a networkadministrator from an attached administrative console or through anetwork. The de-duplication module 390 is an application layer whichidentifies and eliminates duplicate data blocks (“de-duplication”) andtriggers data block sharing in accordance with the embodimentsintroduced herein.

The operating system 300 also includes, or has access to, datarepositories that are used to implement the data block sharing. The datarepositories can include, but are not limited to, a fingerprintsdatastore, a changelog file, an active map, and reference count file.Embodiments of the data repositories are described in greater detailbelow in conjunction with FIGS. 4A-4C and FIG. 23. Although embodimentsare shown within the storage operating system, it will be appreciatedthat embodiments may be implemented in other modules or components ofthe storage server. In addition, embodiments may be implemented as oneor a combination of a software-executing processor, hardware or firmwarewithin the storage server. As such, embodiments may directly orindirectly interface with modules of the storage operating system.

As used herein, the term “storage operating system” generally refers tothe computer-executable code operable on a computer to perform a storagefunction that manages data access and may implement data accesssemantics of a general purpose operating system. The storage operatingsystem can also be implemented as a microkernel, an application programoperating over a general-purpose operating system, such as UNIX® orWindows XP®, or as a general-purpose operating system with configurablefunctionality, which is configured for storage applications as describedherein.

In addition, it will be understood to those skilled in the art that theembodiments described herein may apply to any type of special-purpose(e.g., file server or storage serving appliance) or general-purposecomputer, including a standalone computer or portion thereof, embodiedas or including a storage system. Moreover, the embodiments can beadapted to a variety of storage system architectures including, but notlimited to, a network-attached storage environment, a storage areanetwork and disk assembly directly-attached to a client or hostcomputer. The term “storage system” should therefore be taken broadly toinclude such arrangements in addition to any subsystems configured toperform a storage function and associated with other equipment orsystems. It should be noted that while this description is written interms of a write anywhere file system, the embodiments may be utilizedwith any suitable file system, including conventional write in placefile systems.

FIGS. 4A-4B show block diagrams of storage environments in whichembodiments can be implemented. De-duplication can occur in-line andoffline. When de-duplication occurs while data is being written to afile system, the de-duplication can be referred to as ‘in-linede-duplication.’ When de-duplication occurs after data is written to astorage device (e.g., disk), the de-duplication can be referred to as‘offline de-duplication.’ A storage server 450 is coupled to a storagedevice 461 in storage 460 and the storage device 460 stores blocks ofdata. The storage server 450 includes a de-duplication module 451 thatgenerates a fingerprint for each data block stored on the storage device461. The storage server 450 is coupled to a fingerprints datastore(e.g., a fingerprints database), which stores entries for every blockwithin a file system. An entry includes the fingerprint (e.g., checksum)for the data block.

When de-duplication runs for the first time, the de-duplication module451 scans the blocks and creates a fingerprints datastore, whichcontains fingerprints for used blocks in the storage device 461. Thefingerprints datastore can store an entry (e.g., a fingerprint record)for each data block that is written to the storage device 461 in thestorage 460. An entry includes a fingerprint (fingerprint value) for thedata block. A “fingerprint” or “fingerprint value” may be a checksum,for example. The fingerprints are used in a de-duplication operation forefficiently identifying duplicate data blocks, i.e., to identify datablocks that can be shared. A de-duplication operation is described belowin detail, according to embodiments.

When new data blocks are written or updated in the file system, newfingerprint entries are created and logged into a changelog 463. Duringa de-duplication operation, the entries in the fingerprints datastoreare compared to the entries in the changelog 463 to identify and freeduplicate data blocks so as to leave only one instance of each uniquedata block in the file system.

In one embodiment, the fingerprints datastore is divided into a primarydatastore 453 and a secondary datastore 457. The primary datastore 457includes a single entry for each unique fingerprint 455 and thesecondary datastore 459 includes an entry having an identicalfingerprint as an entry in the primary datastore 457. The de-duplicationmodule 451 merges entries in a changelog 463 with the entries 455 in theprimary datastore 453 to identify duplicate data blocks in the storagedevice 461 and frees the identified duplicate data blocks in thestorage. The de-duplication module 451 stores the entries thatcorrespond to the freed data blocks to a third datastore and overwritesthe primary datastore 453 with the entries from the merged data thatcorrespond to the unique fingerprints to create an updated primarydatastore.

Typically, during a de-duplication operation, the fingerprints datastoreis overwritten each time with the entries from the current fingerprintsdatastore and the changelog 463. Overwriting the entire fingerprintsdatastore with every de-duplication operation, however, can involve alarge amount of write cost. FIG. 4B shows another embodiment of afingerprints datastore that is a master datastore 473 and one or moredatastore segments 477 to avoid overwriting the master datastore 473during every de-duplication operation. The master datastore 473 includesan entry 475 for each data block that is written to the storage device481. Each data block has a fingerprint that is generated by thede-duplication module 471. The one or more datastore segments 477include an entry 479 for a new data block or modified data block that issubsequently written to the storage device 481. Each new and modifieddata block has a fingerprint that is generated by the de-duplicationmodule 471. The de-duplication module 471 delays overwriting the masterdatastore 473 until a segment count threshold is reached or until averify operation is triggered. The de-duplication module 471 can thenoverwrite the master datastore 473 with the entries 479 in all of thedatastore segments 477 and the entries 475 in the master 473 datastoreto create an updated master datastore.

FIG. 4C shows a high-level de-duplication method, according to certainembodiments. Method 400 can be performed by processing logic that cancomprise hardware (e.g., circuitry, dedicated logic, programmable logic,microcode, etc.), software (e.g., instructions run on a processingdevice), or a combination thereof. In one embodiment, method 400 isperformed by a de-duplication module (e.g., de-duplication module 390 inFIG. 3) hosted by storage servers 110 of FIG. 1A.

The first phase of a de-duplication process includes identifying andeliminating duplicate data blocks. The identifying and eliminating ofduplicate data blocks is hereinafter referred to as a ‘de-duplicationoperation’ and ‘block freeing operation.’ At instruction block 401, themethod identifies duplicate data blocks. One embodiment of a method foridentifying duplicate data blocks using a fingerprints datastore that isdivided into multiple parts, such as a primary datastore and a secondarydatastore, is described in greater detail in conjunction with FIGS.9A-9B. Another embodiment of a method for identifying duplicate datablocks using a fingerprints datastore that is organized into a masterdatastore and datastore segments is described in greater detail inconjunction with FIGS. 11A-11B. Another embodiment of a method foridentifying duplicate data blocks using VVBNs (virtual volume blocknumbers) is described in greater detail in conjunction with FIGS.14A-14B.

Once the duplicate data blocks are identified, the method eliminates theidentified duplicate blocks (e.g., actual duplicate data blocks) atinstruction block 403 so as to leave only one instance of each uniquedata block. Eliminating the duplicate data blocks includes sharing theremaining instance of each data block that was duplicated and freeingthe (no longer used) duplicate data block(s). One embodiment of a methodfor eliminating a data block, such as a duplicate block, is described ingreater detail in conjunction with FIG. 23. At instruction block 403,the method updates a reference count file and an active map. Anembodiment of updating a reference count file and an active map aredescribed in greater detail below in conjunction with FIG. 23.

The fingerprint entries that correspond to the eliminated duplicate datablocks and remain in a fingerprints datastore (FPDS) are referred to as‘stale’ fingerprint entries. A verify operation identifies stalefingerprint entries from the FPDS. The identifying and removing of stalefingerprint entries is hereinafter referred to as a ‘verify operation,’‘stale record removal operation,’ ‘verify phase,’ ‘verify scan,’ and‘checking phase.’ In one embodiment, a verify operation identifies stalefingerprint entries and the stale entries are removed from afingerprints datastore during a subsequent de-duplication operation. Astale fingerprint entry is an entry that has a fingerprint thatcorresponds to a block that has been deleted or overwritten, forexample, at instruction block 403.

At instruction block 405, the method determines whether to perform averify operation. A verify operation can be automatically triggered whenthe number of stale entries in a fingerprints datastore reaches orexceeds a stale entries threshold. In another example, a verifyoperation can be triggered from CLI (command line interface). A verifyoperation can also be user driven, for example, by the de-duplicationmodule receiving instructions entered by a user via a command lineinterface.

If there is a trigger for the verify operation, the method identifiesand removes the stale fingerprint entries from the fingerprintsdatastore at instruction block 407. One embodiment of a method foridentifying and removing stale fingerprint entries when a nextde-duplication operation is invoked is described in greater detail inconjunction with FIGS. 16A-16B. Another embodiment of a method foridentifying and removing stale fingerprint entries as a background jobis described in greater detail in conjunction with FIG. 24.

If there is not a trigger for the verify operation, the methoddetermines whether a de-duplication operation start request (e.g., ‘sisstart’ command or ‘SIS request’) is received at instruction block 409.New data blocks may be written or updated in storage (e.g., storage170A,B in FIG. 1A) and can trigger the storage server (e.g., ade-duplication module in a storage server) to compute fingerprints ofthe new data blocks and update the fingerprints datastore to reflect thenew data blocks. When new data blocks are written or updated in a filesystem, the storage server creates and logs new fingerprint entries intoa changelog file. The method may detect a SIS request to performde-duplication on the updated fingerprints datastore. If a SIS requestis detected, the method returns to instruction block 401 to identifyduplicate blocks in the fingerprints datastore.

The method 400 can be triggered automatically at predetermined intervalsor at predetermined times, or it may be triggered manually or inresponse to pre-specified events (such as deletion of a file) or inresponse to a pre-specified policy (such as a given number of new blockshaving been collected).

FIGS. 5A, 5B, and 5C are block diagrams illustrating data sharing,according to certain embodiments. Assume for purposes of explanationthat the active file system of a file server maintains two simple files,named Foo and Bar, shown in FIG. 5A. File Foo contains two data blocks,and file Bar contains two data blocks. Each data block is identified inthe file system by (among other things) its volume block number (VBN). AVBN identifies the logical block where the data is stored (since RAIDaggregates multiple physical drives as one logical drive), as opposed toa physical block. A VBN should be distinguished from a disk block number(DBN) which identifies the physical block number within a disk in whichthe block is stored, or a file block number (FBN) which identifies thelogical position of the data within a file. The two blocks of file Foohave VBN values of 1 and 2. VBN 1 contains the data, “A”, while VBN 2contains the data, “B”. The two blocks of file Bar have VBN values of 3and 4. VBN 3 contains the data, “C”, while VBN 4 contains the data, “D”.

For each VBN maintained by the file system, a reference count fileincludes a value, REFCOUNT, indicating the number of references to thatVBN. The reference count file contains an entry (e.g., record) for eachdata block maintained by the storage server, wherein each entry includesa value, REFCOUNT, indicating the number of references to that datablock. For example, a data block which is shared by two files would havea REFCOUNT value of 2. A data block can be shared by more than two files(or other entities), in which case the REFCOUNT value would reflect thisaccordingly. A data block which is allocated but not shared would have aREFCOUNT value of 1. A data block which is not yet allocated would havea REFCOUNT value of 0. In certain embodiments, the REFCOUNT value foreach data block is a two-byte binary value, which allows each data blockto be the target of up to 2¹⁶−1 references. In the example of FIG. 5A,for VBNs [1,2,3,4] the REFCOUNT values are [1,1,1,1], respectively,indicating that each VBN is the target of one reference.

Refer now to FIG. 5B, which is a variation of the example of FIG. 5A, inwhich VBNs 3 and 4 of file Bar have the same data (“A”) as VBN 1 of fileFoo. That is, VBNs 3 are 4 are duplicates of VBN 1 and of each other.Initially, when a data block is allocated by the file system, itsREFCOUNT value in the reference count file is set equal to 1.Accordingly, before duplicate data blocks are identified in the exampleof FIG. 5B, the REFCOUNT values for the example of FIG. 5B are the sameas in FIG. 5A, i.e., [1,1,1,1], as shown.

In contrast, FIG. 5C shows what the example of FIG. 5B would look likeafter duplicate data blocks have been identified (e.g., at instructionblock 401 in method 400 of FIG. 4) and sharing is implemented (e.g., atinstruction block 403 in method 400 of FIG. 4). Sharing involves giving,to each entity which owns a shareable data block, a pointer to that datablock. Accordingly, in the example of FIG. 5C this involves giving fileBar two pointers to VBN 1 (file Foo already had a pointer to VBN 1).Data sharing also involves eliminating the duplicate data blocks, VBNs 3and 4, and freeing them for reuse (e.g., at step 403 of FIG. 4). Oncethe data sharing is completed, the REFCOUNT values for VBNs [1,2,3,4]are adjusted to be [3,1,0,0], respectively, to reflect the fact thatVBNs 3 and 4 have been freed and VBN 1 now has three references to it(i.e., VBN 1 is shared). One embodiment of a method for freeingduplicate data blocks for reuse is described in greater detail inconjunction with FIG. 23.

Returning to FIG. 5C, the data sharing continually updates the referencecount file to reflect events that affect these blocks. For example, iffile Foo is now deleted, the REFCOUNT values for VBNs [1,2,3,4] would beadjusted to be [2,0,0,0], respectively, reflecting that VBN 2 has beenfreed in addition to VBNs 3 and 4. Note that VBN 1 has not been freed(i.e., its REFCOUNT value is not zero), since VBN 1 is still in use byfile Bar; instead the REFCOUNT value for VBN 1 has been decremented from3 to 2. If file Bar is now deleted, the REFCOUNT values for VBNs[1,2,3,4] would be adjusted to be [0,0,0,0], respectively.

In one embodiment, the data sharing uses a file system that adheres tothe copy-on-write principle; that is, anytime a data block is modified,it is written to a different VBN, rather than modifying the data inplace. Referring back to the example of FIG. 5C, therefore, assume thata write request from a client causes the data “A” in file Bar to bechanged to “F”. In this case, VBN 1, which contains the data “A”, is notmodified. However, since the new data, “F”, is written to a new logicaland physical block, the REFCOUNT value for VBN 1 must still be updated.Hence, the REFCOUNT value for VBN 1 in this case would be decremented byone. In addition, the REFCOUNT value for whichever VBN is allocated tostore the new data, “F”, would be incremented by one.

In another embodiment, data sharing uses a file system which does notimpose copy-on-write in all instances. For example, the data sharing canbe implemented by requiring copy-on-write only when the REFCOUNT valuefor given data block is greater than one.

To avoid data inconsistencies, when a data container (e.g., file) whichcontains one or more shared blocks is modified, its REFCOUNT values andblock pointers are updated in a single atomic transaction. This updatingmay be done, for example, during a “consistency point”, i.e., when a setof accumulated write transactions are committed from temporary storageto persistent storage.

The data in the reference count file may become corrupted, for any ofvarious reasons. The storage server scans the entire active file systemfor consistency with the reference count file before boot-up of the filesystem to ensure the consistency between the reference count file andthe actual state of the file system, according to one embodiment. Thescanning can include creating a separate, temporary reference count filein main memory of the file server, scanning all data blocks in the filesystem to identify shared data blocks, and updating the temporaryreference count file to reflect any shared data blocks. The temporaryreference count file is then compared to the regular (persistent,on-disk) reference count file to determine whether they match. If theydo not match, an inconsistency is identified, and appropriate correctiveaction is taken.

In another embodiment, the consistency check is run while the filesystem is in operation. The storage server creates a temporary referencecount file on a storage device (e.g., disk), so as not to consume mainmemory in the storage server. In that case, however, if the usermodifies a particular block while the consistency check is running, itis necessary to update both the temporary and the persistent referencecount files.

Various other optimizations can be added to the above described datasharing. For example, a SHARED flag can be provided for each datacontainer (e.g., file) in the file system, to indicate whether the filecontains any shared blocks. The SHARED flag can be stored in aconvenient location, such as in the file's inode (a container ofmetadata about the file, used by the file system), to allow fastdetermination of whether it is necessary to read the reference countfile when modifying a block. This avoids unnecessarily having to readthe (large) reference count file when the file includes no sharedblocks. Similarly, another flag can be implemented for each volume inthe storage system, to indicate whether the volume is allowed toimplement block sharing. The benefit, as in the previous example, isavoiding the need to read the reference count file in all cases.

Further, one or more counters can be implemented in the file system totrack the total number of shared blocks. These counters can be used toprovide an output to a user (e.g., a storage network administrator)indicating the amount of storage device space (e.g., disk space) beingsaved as a result of block sharing.

According to certain embodiments, at any particular point in time ablock will be in one of the following states: free, in-use,fingerprinted, shared, and zombie. A free block is a block that is notbeing used (not allocated). An in-use block is a block that is beingused and has not yet been processed by the de-duplication operation. Afingerprinted block is a block that has been processed by thede-duplication operation, and for which an entry (e.g., record) has beenadded into the fingerprints datastore to track the block. A shared blockis a block that has become shared and for which one or more duplicatesof this block have been identified and eliminated. A zombie is a blockthat was shared but now is no longer used by any files, but the blockhas not yet been freed. FIG. 6 illustrates how a block can transitionthrough the various states in response to various events, according tocertain embodiments.

FIG. 7 illustrates the elements of a de-duplication module 700 (e.g.,de-duplication module 390 in FIG. 3) coupled to a primary FPDS 760 and asecondary FPDS 765, according to certain embodiments. The elementsinclude a de-duplication engine 710, a gatherer module 720, afingerprint manager 730, a fingerprint handler 740, a block sharingengine 750, and a stale fingerprint manager 780. The de-duplicationmodule 700 can be coupled to a fingerprints datastore that storesfingerprints of data blocks that have been written to storage (e.g.,storage 170A,B in FIG. 1A).

Conventional de-duplication solutions include a single, significantlylarge, FPDS. Therefore, during a de-duplication operation when the FPDSis sorted and merged with a sorted changelog to identify potentialduplicate data blocks, there is a significant time taken to sort thelarge FPDS. According to certain embodiments, a FPDS is improved bydividing it into more than one datastore (e.g., a primary datastore anda secondary datastore) to reduce the overall time taken to identifypotential duplicate blocks. Another embodiment of an improved FPDSincludes a FPDS that is organized into segments as described inconjunction with FIGS. 11A-11B.

When de-duplication runs for the first time on a flexible volume withexisting data, the de-duplication 700 module scans the blocks in theflexible volume and creates a fingerprints datastore (FPDS), whichcontains a sorted list of all fingerprints for used blocks in theflexible volume, according to some embodiments. The FPDS can store anentry (e.g. fingerprint record) for each data block that is written tothe storage.

In one embodiment, the fingerprint manager 730 divides and manages thefingerprints datastore as multiple datastores, such as a primaryfingerprints datastore 760 and a secondary fingerprints datastore 765. Aprimary FPDS 760 contains an entry (e.g., fingerprint record) for eachunique fingerprint value. A secondary FPDS 765 contains fingerprintsentries that have the same fingerprint value as an entry (e.g., record)in the primary FPDS 760. Unlike conventional de-duplication solutions,the sorted primary FPDS 760 is significantly smaller by storing entriesfor only unique fingerprints and the entries in this smaller datastoreis merged with the entries in a changelog to reduce the overall timetaken to identify potential duplicate blocks.

The de-duplication engine 710 schedules and triggers operations of theother modules. In particular, the de-duplication engine 710 triggersoperation of the gatherer module 720, which may be done according to apre-specified schedule, timing algorithm, or in response to a manualinput. The de-duplication engine 710 can detect a request to start ade-duplication operation (e.g., sis start command) and start ade-duplication operation. For example, the de-duplication engine 710invokes the gatherer module 720. In one embodiment where a verifyoperation is currently in progress, the de-duplication engine 710detects a de-duplication start request and notify the stale fingerprintmanager 780. One embodiment for invoking a de-duplication operation andperforming a verify operation in the background is described in detailin conjunction with FIGS. 19-20.

When de-duplication runs for the first time, the gatherer module 720identifies each data block that has been written to storage (e.g.,storage 170A,B in FIG. 1A) and triggers the fingerprint handler 740 tocompute fingerprints for the data blocks and return them to the gatherermodule 720. In one embodiment, the gatherer module 720 operates as aninitial scanner and also as a scanner for subsequent de-duplicationoperations. In another embodiment, the gatherer module 720 operates asan initial scanner and a de-duplication scanner (not shown) identifiesnew data blocks that are written to storage in subsequent de-duplicationoperations. The fingerprint handler 740 is responsible for computing thefingerprints of data blocks. In certain embodiments, the fingerprinthandler 740 calculates a checksum, such as an MD5 checksum, to compute afingerprint. One embodiment of a method for computing the fingerprintsof data blocks is described in greater detail in conjunction with FIG.21.

The fingerprint manager 730 receives the fingerprints of the data blocksthat have been written to storage (e.g., storage 170A,B in FIG. 1A) fromthe gatherer module 720 and stores an entry (e.g., a fingerprint record)for each data block that is written to the storage in a FPDS. Thefingerprint manager 730 divides and manages the fingerprints datastoreas multiple datastores, such as a primary fingerprints datastore 760 anda secondary fingerprints datastore 765. One embodiment for dividing aFPDS into a primary datastore and a secondary datastore, described ingreater detail in conjunction with FIGS. 8A-8B. The fingerprint manager730 creates an entry for each data block that has been written tostorage. In one embodiment, an entry can include, and is not limited to,the fingerprint of the block, the inode number of the file to which theblock belongs, and the FBN (file block number) of the block. In oneembodiment, the fingerprint manager 730 sorts the entries in the FPDS(e.g., primary FPDS 760, secondary FPDS 765) by fingerprint value. Oneembodiment of a sorting process, described in greater detail inconjunction with FIG. 22.

The gatherer module 720 also identifies new data blocks that are writtenor updated in storage and triggers the fingerprint handler 740 tocompute fingerprints of the new data blocks and return them to thegatherer module 720. In certain embodiments, the fingerprint manager 730also maintains a changelog file (e.g., changelog 770) that is coupled tothe de-duplication module 700 for identifying blocks that are new ormodified since the last time the process of FIG. 4C was executed.

When new data blocks are written or updated in a file system, thefingerprint manager 730 logs an entry (e.g., fingerprint record) intothe changelog 770. In one embodiment, the changelog 770 containsinformation of the same type as the fingerprints datastore (e.g.,primary FPDS 760, secondary FPDS 765) (i.e., fingerprint of the block,inode number of the file to which the block belongs, and the FBN of theblock), but only for new or modified blocks. In one embodiment, thefingerprint manager 730 sorts the entries the changelog 770, forexample, by fingerprint value.

The fingerprint manager 730 detects subsequent de-duplication startrequests (e.g., sis start commands) and re-executes a sorting process onthe entries in a FPDS (e.g., primary FPDS 760) and the changelog 770 byfingerprint value. The fingerprint manager 730 merges entries in asorted FPDS (e.g., primary FPDS 760) with entries in a sorted changelog770 to identify potentially duplicate data blocks, by finding entrieswith matching fingerprints.

The block sharing engine 750 is responsible for comparing potentiallyduplicate data blocks identified by the fingerprint manager 730 to eachother to identify actual duplicate data blocks. The blocks representedby any entries which have identical fingerprints are considered to bepotential duplicate blocks, rather than actual duplicates, since thereis always a possibility that two non-identical blocks could have thesame fingerprint, regardless of the fingerprint scheme being used. Theblock sharing engine 750 also eliminates the actual duplicate datablocks and implements block sharing by calling functions of a filesystem (e.g., file system 360 in FIG. 3). Eliminating the duplicate datablocks can include sharing the remaining instance of each data blockthat was duplicated and freeing the (no longer used) duplicate datablock(s). For performance reasons, multiple block share operations maybe ongoing at any given time. One embodiment of a method for eliminatinga data block, such as a duplicate block, is described in greater detailin conjunction with FIG. 23.

In some embodiments, the fingerprint manager 730 writes the fingerprintentries that correspond to the eliminated actual duplicate data blocksto a third datastore (e.g., file) and overwrites the primary FPDS 760with the fingerprint entries that correspond to the unique data blocksto create an updated primary FPDS. The updated primary FPDS can be usedfor a verify operation to identify and remove stale fingerprint entriesthat correspond to eliminated data blocks and/or for a subsequentde-duplication operation.

The stale fingerprint manager 780 performs a verify operation toidentify and remove ‘stale’ fingerprint entries from a fingerprintsdatastore (e.g., primary datastore 760), according to one embodiment. Inanother embodiment, the stale fingerprint manager 780 removes stalefingerprints during a subsequent de-duplication operation. A stale entry(e.g., stale fingerprint record) is an entry that has a fingerprint thatcorresponds to a data block that has been eliminated (deleted oroverwritten) by the block sharing engine 750. The fingerprint manager730 saves context information in the fingerprints entries (e.g., entriesin the primary FPDS 760, secondary FPDS 765, and any changelogs 770) foreach block, such as the value of a consistency point counter at the timethe block was written to a storage device (e.g., disk). The stalefingerprint manager 780 uses the context information to detect anddelete stale fingerprint entries from the fingerprints datastore (e.g.,primary datastore 760). Entries having higher consistency point countervalues are more recent than entries with lower consistency point countervalues. In one embodiment, the stale fingerprint manager 780 identifiesfingerprint entries having the same <inode, fbn> as other entries, butwith lower consistency point counter values compared to the otherentries, as stale fingerprint entries. In another embodiment, the stalefingerprint manager 780 identifies fingerprint entries having the same<vvbn> as other entries, but with lower consistency point counter valuescompared to the other entries, as stale fingerprint entries. Theunidentified entries are stale free fingerprint entries. Complementaryto this functionality, information on the deleted files and/or blocks inthe deletion code path is also logged and used to clean up staleentries. Embodiments for identifying and removing stale fingerprintentries are described in greater detail in conjunction with FIGS. 15-20.

FIG. 8A is a block diagram 800 of a fingerprints datastore (FPDS) 871Athat is divided into a primary fingerprints datastore 871B and asecondary fingerprints datastore 873, according to certain embodiments.The entries in the FPDS 871A are sorted, for example, by fingerprint, toidentify entries that have the same fingerprint values. These entriescorrespond to data blocks that are potentially duplicate data blocks.The entries pertaining to the potential duplicate data blocks arefurther analyzed to identify actual duplicate data blocks. Thefingerprint entries that correspond to the actual duplicate data blocksare written to a secondary datastore (e.g., a secondary file) to createa secondary FPDS 873. The old fingerprints FPDS 871A is overwritten withthe fingerprint entries that correspond to the unique fingerprints tocreate the primary FPDS 871B. The entries in the primary FPDS 871B canbe sorted by fingerprint.

FIG. 8B is a flow diagram of the method 850 for creating a primary and asecondary datastore, according to certain embodiments. The flow diagramcorresponds to block diagram 800 in FIG. 8A. Method 800 can be performedby processing logic that can comprise hardware (e.g., circuitry,dedicated logic, programmable logic, microcode, etc.), software (e.g.,instructions run on a processing device), or a combination thereof. Inone embodiment, method 850 is performed by a de-duplication module(e.g., de-duplication module 390 in FIG. 3) hosted by storage servers110 of FIG. 1A.

At instruction block 801, the method sorts the entries in a fingerprintsdatastore (e.g., fingerprint.0 871A), for example, by fingerprint.Sorting the fingerprints datastore is optional but allows fasteridentification of duplicate data blocks, since the entries for anyduplicate data block will reside adjacent to each other in thefingerprints datastore after the sort operation (e.g., by fingerprintvalue) is complete. One embodiment of a method sorting fingerprints isdescribed in greater detail in conjunction with FIG. 22.

At instruction block 803, the method determines whether there are anyfingerprint entries having identical fingerprint values and identifiesthese entries as entries that correspond to potential duplicate datablocks. The blocks represented by any entries which have identicalfingerprints are considered to be potential duplicate blocks, ratherthan actual duplicates, since there is always a possibility that twonon-identical blocks could have the same fingerprint, regardless of thefingerprint scheme being used. If there are entries that have identicalfingerprint values (instruction block 803), the method compares the datablocks corresponding to these entries to determine whether any of thedata blocks are actual duplicate data blocks at instruction block 805.The method can perform a byte-by-byte comparison of the potentiallyduplicate data blocks to determine whether the data blocks are actuallyidentical. In an alternative embodiment, instruction block 805 could beeliminated if an approximate verification of comparing fingerprints isdeemed sufficient in determining that two blocks are identical.

If there are actual duplicate data blocks, the method writes thefingerprint entries that correspond to the actual duplicate data blocksto a second datastore (e.g., a secondary file) to create a secondaryfingerprints datastore (e.g., secondary FPDS 873) at instruction block809. At instruction block 811, the method overwrites the existingfingerprints datastore (e.g., fingerprint.0 871A) with the entriescorresponding to the unique fingerprints to create a primaryfingerprints datastore (e.g., primary FPDS 871B). In one embodiment, themethod sorts the entries in the primary fingerprints datastore 871B, forexample, by fingerprint, at instruction block 813.

FIG. 9A is a block diagram 900 for de-duplication (e.g., identifying andremoving fingerprint entries corresponding to duplicate data blocks)using a primary fingerprints datastore and a secondary fingerprintsdatastore, according to certain embodiments. One embodiment creates andupdates a sorted primary FPDS 971A and a sorted secondary FPDS 975. Aprimary FPDS 971A contains an entry (e.g., fingerprint record) for eachunique fingerprint value. A secondary FPDS 975 contains fingerprintsentries that have the same fingerprint value as an entry (e.g., record)in the primary FPDS 971A. In one embodiment, the secondary FPDS 975 canbe a segmented datastore, as described in conjunction with FIG. 11A, andmaintained in a sorted order (e.g., by <inode,fbn>). In one embodiment,an entry can include, and is not limited to, the fingerprint of theblock, the inode number of the file to which the block belongs, and theFBN (file block number) of the block. In certain embodiments, eachfingerprint is a checksum, such as an MD5 checksum.

When new data blocks are written or updated in a file system, a storageserver creates and logs new fingerprint entries into a changelog file973A. In one embodiment, the changelog 973A contains information of thesame type as the fingerprints datastore (e.g., primary FPDS 971A andsecondary FPDS 975) (i.e., fingerprint of the block, inode number of thefile to which the block belongs, and the FBN of the block), but only fornew or modified blocks. The changelog 973A is sorted when ade-duplication start request (e.g., start sis command) is detected.

The sorted changelog 973B is merged in-memory with the sorted primaryFPDS 971A to identify potential duplicate data blocks. Unlikeconventional de-duplication solutions, the sorted primary FPDS 971A issignificantly smaller by storing only entries having unique fingerprintvalues. Therefore, an in-memory merge of the sorted changelog 973B witha smaller FPDS (e.g., primary FPDS 971A), at reference 908, reduces theoverall time taken to identify potential duplicate blocks. An in-memorymerge of data refers to merging data temporary in memory. An on-diskmerge of data refers to writing merged data to a storage device (e.g.,disk).

The potential duplicate data blocks are further analyzed to identify theactual duplicate data blocks. The fingerprint entries that correspond tothe eliminated actual duplicate data blocks are written to a thirddatastore (e.g., file) 977. The primary FPDS 971A is overwritten withthe fingerprint entries that correspond to the unique fingerprints tocreate an updated primary FPDS 971B. Subsequently, during a verifyoperation to identify and remove the fingerprint entries that correspondto the eliminated actual duplicate data blocks (stale fingerprintentries), the third datastore 977 is merged with the sorted secondaryFPDS 975, which is then merged with the updated primary FPDS 971B toidentify and remove stale fingerprint entries. One embodiment of amethod for identifying and removing fingerprint entries, whichcorrespond to duplicate data blocks, using a primary fingerprintsdatastore and a secondary fingerprints datastore is described in greaterdetail in conjunction with FIGS. 17A-17B.

FIG. 9B is a flow diagram of a method 950 for de-duplication (e.g.,identifying and removing fingerprint entries corresponding to duplicatedata blocks) using a primary fingerprints datastore and a secondaryfingerprints datastore, according to certain embodiments. The flowdiagram corresponds to block diagram 900 in FIG. 9A. Method 950 can beperformed by processing logic that can comprise hardware (e.g.,circuitry, dedicated logic, programmable logic, microcode, etc.),software (e.g., instructions run on a processing device), or acombination thereof. In one embodiment, method 900 is performed by ade-duplication module (e.g., de-duplication module 390 in FIG. 3) hostedby storage servers 110 of FIG. 1A.

According to one embodiment, a fingerprints datastore is divided intomultiple datastores, such as a primary FPDS and a secondary FPDS. Aprimary datastore stores a single entry (e.g., record) for eachfingerprint value, and thus is smaller than a conventional fingerprintsdatastore. A secondary fingerprints datastore stores the remainingentries, such as the entries that have the same fingerprint value as anentry in the primary FPDS. In one embodiment, the secondary FPDS 975 canbe a segmented datastore, as described in conjunction with FIG. 11A, andmaintained in a sorted order (e.g., by <inode,fbn>). At instructionblock 901, the method maintains a primary FPDS (e.g., primary FPDS 971A)and a secondary FPDS (e.g., secondary FPDS 975) and sorts the primaryFPDS 971A and the secondary FPDS 975 at instruction block 903.

At instruction block 905, the method detects a de-duplication operationstart request (e.g., sis start command) and sorts a changelog (e.g.,changelog 973A) by fingerprint at step 907. At instruction block 909,the method merges the sorted changelog 973B with the sorted primary FPDS971A to identify potential duplicate data blocks at instruction block911. As noted above, unlike conventional de-duplication solutions, thesorted primary FPDS 971A is significantly smaller by storing onlyentries having unique fingerprint values. Therefore, merging the sortedchangelog 973B with a smaller FPDS (e.g., primary FPDS 971A reduces theoverall time taken to identify potential duplicate blocks.

If there are entries that have identical fingerprint values (instructionblock 911), the method performs a byte-to-byte comparison of the datablocks that correspond to the entries that have identical fingerprintvalues to identify data blocks that are actual duplicate data blocks atinstruction block 913. If there are any data blocks that are actualduplicate data blocks (instruction block 915), the method eliminates theactual duplicate data blocks are eliminated at instruction block 917.Eliminating the duplicate data blocks can include sharing the remaininginstance of each data block that was duplicated and freeing the (nolonger used) duplicate data block(s). The method frees the duplicateblock or blocks so that only one instance remains of each unique block,and shares the remaining instance of the block to the extent possible.The method then updates a reference count file and an active map atinstruction block 919 to reflect the newly shared and freed blocks. Oneembodiment of a method for eliminating a data block, such as a duplicateblock, is described in greater detail in conjunction with FIG. 23.

At instruction block 921, the method writes the fingerprint entries thatcorrespond to the eliminated actual duplicate data blocks to a thirddatastore 977 (e.g., file) and overwrites the primary FPDS (e.g.,primary FPDS 971A) with the fingerprint entries that correspond to theunique fingerprints to create an updated primary FPDS (e.g., primaryFPDS 971B) at instruction block 923. In one embodiment, the methoddetermines whether the entire primary FPDS 971A has been examined. Ifthe entire primary FPDS 971A has not been examined, the method returnsto instruction block 911 to identify entries that have identicalfingerprints, until the primary FPDS 971A has been examined.

FIG. 10 illustrates the elements of a de-duplication module 1000 (e.g.,de-duplication module 390 in FIG. 3) coupled to a segmented fingerprintsdatastore, according to certain embodiments. The elements include ade-duplication engine 1010, a gatherer module 1020, a fingerprintmanager 1030, a fingerprint handler 1040, a block sharing engine 1050,and a stale fingerprint manager 1080. The de-duplication module 1000 canbe coupled to a fingerprints datastore that stores the fingerprints ofall data blocks that have been written to storage (e.g., storage 170A,Bin FIG. 1A).

When de-duplication runs for the first time on a flexible volume withexisting data, the de-duplication 1000 module scans the blocks in theflexible volume and creates a fingerprints datastore (FPDS), whichcontains a sorted list of all fingerprints for used blocks in theflexible volume, according to some embodiments. The FPDS can store anentry (e.g., fingerprint record) for each data block that is written tothe storage.

Conventional de-duplication solutions include a single, significantlylarge, FPDS that has a flat file structure. With a flat file structure,the FPDS is overwritten with every de-duplication operation. Duringevery de-duplication operation, the entries in a changelog are mergedwith the entries in the FPDS, and the old FPDS is overwritten with themerged entries. Therefore, traditional implementations incur asignificant cost in overwriting the entire FPDS with everyde-duplication operation, irrespective of the size of the changelog.According to certain embodiments, a FPDS is improved by organizing theFPDS as a master datastore and datastore segments to avoid overwritingan entire FPDS with every de-duplication operation.

In one embodiment, the fingerprint manager 1030 organizes a FPDS asmultiple segments, such as a master fingerprints datastore 1060 anddatastore segments 1 to n (1065-1 to 1065-n). A master FPDS 1060 storesan entry (e.g., fingerprint record) for each data block that is writtento the storage (e.g., storage 170A,B in FIG. 1A). In one embodiment, anentry can include, and is not limited to, the fingerprint of the block,the inode number of the file to which the block belongs, and the FBN(file block number) of the block. A datastore segment containsinformation of the same type as the fingerprints datastore (e.g., masterFPDS 1060) (i.e., fingerprint of the block, inode number of the file towhich the block belongs, and the FBN of the block), but only for new andmodified data blocks written to the storage (e.g., storage 170A,B inFIG. 1A). Unlike conventional de-duplication solutions, the masterfingerprints datastore 1060 is not overwritten with every de-duplicationoperation, but the overwriting is delayed until a verify operation isperformed or until a threshold for the number of FPDS segments isreached.

The de-duplication engine 1010 schedules and triggers operations of theother modules. In particular, the de-duplication engine 1010 triggersoperation of the gatherer module 1020, which may be done according to apre-specified schedule, timing algorithm, or in response to a manualinput. The de-duplication engine 1010 detects a request to start ade-duplication operation (e.g., sis start command) and starts ade-duplication operation. For example, the de-duplication engine 1010invokes the gatherer module 1020. In one embodiment where a verifyoperation is currently in progress, the de-duplication engine 1010detects a de-duplication start request and notify the stale fingerprintmanager 1080. One embodiment for invoking a de-duplication operation andperforming a verify operation in the background is described in detailin conjunction with FIGS. 19-20.

When de-duplication runs for the first time, the gatherer module 1020identifies each data block that has been written to storage (e.g.,storage 170A,B in FIG. 1A) and triggers the fingerprint handler 1040 tocompute fingerprints for the data blocks and return them to the gatherermodule 1020. The fingerprint manager 1030 receives the fingerprints fromthe gatherer module 1020 and stores an entry (e.g., a fingerprintrecord) for each data block that is written to the storage in a FPDS(e.g., master FPDS 1060). The fingerprint manager 1030 creates an entry(e.g., fingerprint record) for each data block that has been written tostorage. The fingerprint manager 1030 sorts the entries in the FPDS(e.g., master FPDS 1060), for example, by fingerprint value. Oneembodiment of a sorting process, described in greater detail inconjunction with FIG. 22.

The gatherer module 1020 also identifies new data blocks that arewritten or updated in storage (e.g., storage 170A,B in FIG. 1A) andtriggers the fingerprint handler 1040 to compute fingerprints of the newdata blocks and return them to the gatherer module 1020. When new datablocks are written or updated in a file system, the fingerprint manager1030 creates and logs an entry (e.g., fingerprint record) into thechangelog 1070. In one embodiment, the changelog 1070 containsinformation of the same type as the fingerprints datastore (e.g., masterFPDS 1060) (i.e., fingerprint of the block, inode number of the file towhich the block belongs, and the FBN of the block), but only for new ormodified blocks. The fingerprint manager 1030 sorts the entries thechangelog 1070, for example, by fingerprint value.

The fingerprint manager 1030 identifies entries in a sorted FPDS thathave matching fingerprints to identify potentially duplicate data blocksand eliminate the duplicate data blocks. Eliminating the duplicate datablocks includes sharing the remaining instance of each data block thatwas duplicated and freeing the (no longer used) duplicate data block(s).For example, the fingerprint manager 1030 detects a de-duplication startrequest (e.g., sis start command) and sort the entries in the FPDS toidentify potentially duplicate data blocks for and eliminate theduplicate data blocks. The fingerprint manager 1030 also detectssubsequent de-duplication start requests (e.g., sis start commands) andre-executes a sorting process on the entries in a FPDS (e.g., masterFPDS 1060) and the changelog 1070 by fingerprint value to identifypotential duplicate data blocks.

In some embodiments, the fingerprint manager 1030 first determineswhether the FPDS (e.g., master FPDS 1060) meets a threshold (e.g.,number of fingerprint entries) and if the FPDS meets a threshold, thefingerprint manager 1030 writes the fingerprint entries in the sortedchangelog 1070 to a new datastore segment, hereinafter referred to as a‘fingerprints datastore (FPDS) segment’, ‘fingerprints segment’, or‘segment’ (e.g., segments 1065-1). The fingerprint manager 1030 sortsthe entries in a segment, for example, by fingerprint value.

The fingerprint manager 1030 merges the entries in a sorted FPDS (e.g.,master FPDS 1060) with entries in the existing sorted segment (e.g.,segment 1065-1) to identify potentially duplicate data blocks, byfinding entries with matching fingerprints. Typically, a fingerprintsdatastore is overwritten with the merged data with each de-duplicationoperation. Unlike conventional de-duplication solutions, the master FPDS1060 is not overwritten with every de-duplication operation, but theoverwriting is delayed until a verify operation is performed or until athreshold for the number of FPDS segments is reached.

With each subsequent de-duplication operation, the fingerprint manager1030 writes the entries in a changelog file to a new FPDS segment (e.g.,segment 1065-n), until a threshold for a segment count threshold for anumber of FPDS segments is reached or a verify operation is to beperformed by the stale fingerprint manager 1080.

When the segment count threshold is reached or when the stalefingerprint manager 1080 is triggered to perform a verify operation, thefingerprint manager 1030 merges the fingerprint entries in the sortedchangelog 1070, the entries in all of the FPDS segments (e.g., 1065-1 to1065A-n), and the entries in the master FPDS 1060 and overwrites the oldmaster FPDS 1060 with the merged data to create a new master FPDS inaggregate. The fingerprint manager 1030 can use the data in the newmaster FPDS for a verify operation to identify and remove stalefingerprint entries that correspond to eliminated data blocks and/or fora subsequent de-duplication operation.

The block sharing engine 1050 compares potentially duplicate data blocksidentified by the fingerprint manager 1030 to each other to identifyactual duplicate data blocks. The block sharing engine 1050 can alsoeliminate the actual duplicate data blocks and implement block sharingby calling functions of a file system (e.g., file system 310 in FIG. 3).One embodiment of a method for eliminating a data block, such as aduplicate block, is described in greater detail in conjunction with FIG.23.

The stale fingerprint manager 1080 performs a verify operation toidentify and remove ‘stale’ fingerprint entries from a fingerprintsdatastore (e.g., a new master FPDS in aggregate), according to oneembodiment. In another embodiment, the stale fingerprint manager 1080removes stale fingerprints during a subsequent de-duplication operation.A stale entry (e.g., stale fingerprint record) is an entry that has afingerprint that corresponds to a data block that has been eliminated(deleted or overwritten) by the block sharing engine 1050. The stalefingerprint manager 1080 detects a request to perform a verifyoperation. For example, the stale fingerprint manage 1080 detects averify operation is triggered when a number of stale entries in a FPDSreaches or exceeds a stale entries threshold. In another example, averify operation is triggered from a CLI. In another example, a verifyoperation is user-driven, for example, by the de-duplication modulereceiving instructions entered by a user via a command line interface.

The fingerprint manager 1030 saves context information in thefingerprints entries (e.g., entries in the master FPDS 1060, allsegments 1065-1 to 1065-n, and any changelogs 1070) for each block, suchas the value of a consistency point counter at the time the block waswritten to a storage device (e.g., disk). The stale fingerprint manager1080 can use the context information to detect and delete stalefingerprint entries from the fingerprints datastore (e.g., a new masterFPDS in aggregate). Entries having higher consistency point countervalues are more recent than entries with lower consistency point countervalues. The stale fingerprint manager 1080 identifies fingerprintentries having the same fingerprint values as other entries, but withlower consistency point counter values compared to the other entries, asstale fingerprint entries. The unidentified entries are stale freefingerprint entries. Embodiments of a method for identifying andremoving stale fingerprint entries is described in greater detail inconjunction with FIGS. 16-20.

FIG. 11A is a block diagram 1100 for de-duplication (e.g., identifyingand removing fingerprint records corresponding to duplicate data blocks)using a segmented fingerprints datastore, according to certainembodiments. One embodiment maintains a sorted master FPDS (masterdatastore) 1171A-0 (e.g., segment.0). A master FPDS 1171A-0 contains anentry (e.g., a fingerprint record) for each data block that is writtento a storage device (e.g., file system). When new data blocks arewritten or updated in a file system, a storage server creates and logsnew entries (e.g., fingerprint records) into a changelog file 1181A. Thechangelog 1181A is sorted when a de-duplication start request (e.g.,start sis command) is detected. When the master FPDS 1171A-0 meets amaster datastore threshold (e.g., a threshold for a number offingerprint entries in the master datastore), the fingerprint entries inthe sorted changelog 1181B are written to a new datastore segment (e.g.,segment 1171A-1). The original master FPDS 1171A-0 remains sorted and ismaintained as is. In every de-duplication operation, the de-duplicationperforms an in-memory merge of all of the segments (including the masterFPDS (e.g., segment.0) to identify potential duplicate blocks. Anin-memory merge of data refers to merging data temporary in memory. Anon-disk merge of data refers to writing merged data to a storage device(e.g., disk). For example, there is one segment segment.1 1171A-1 andthe entries in segment.1 1171A-0 are merged in-memory with the entriesin the sorted master FPDS 1171A-0 to identify and eliminate duplicatedata blocks. Unlike conventional de-duplication solutions whichoverwrite an entire fingerprints datastore with each de-duplicationoperation, the original master FPDS 1171A-0 remains sorted and ismaintained as is, as seen at reference 1108A. The original master FPDS1171A-0 is not overwritten with every de-duplication operation becausethe de-duplication performs an in-memory merge to identify potentialduplicate blocks. Thus, de-duplication is improved to reduce the writecost by delaying the overwriting (on-disk merge) of the master FPDS1171A-0 until a verify operation is to be performed or until a thresholdfor a threshold for a number of FPDS segments is reached.

With each subsequent de-duplication operation, the entries in achangelog file 1181C are written to a FPDS segment, unless a thresholdfor a segment count threshold is reached or until a verify operation isto be performed. For example, after n de-duplication operations, therecan be n FPDS segments (1171A-1 to 1171A-n). During a subsequentde-duplication operation (e.g., de-dupe operation # n), a changelog1181C is sorted when a de-duplication start request (e.g., start siscommand) is detected. When the threshold for the number of FPDS segmentsis reached, the fingerprint entries in the sorted changelog 1181D, theentries in all of the FPDS segments (e.g., 1171A-1 to 1171A-n), and theentries in the master FPDS 1171A-0 are merged on-disk, thus, overwritingthe old master FPDS 1171A-0 with the on-disk merged data to create amaster FPDS 1171B-0 in aggregate. The master FPDS 1171B-0 in aggregatecan be used to identify and eliminate duplicate data blocks.

FIG. 11B is a flow diagram of a method 1150 for de-duplication (e.g.,identifying and removing fingerprint entries corresponding to duplicatedata blocks) using a segmented fingerprints datastore (FPDS), accordingto certain embodiments. The flow diagram corresponds to block diagram1100 in FIG. 11A. Method 1150 can be performed by processing logic thatcan comprise hardware (e.g., circuitry, dedicated logic, programmablelogic, microcode, etc.), software (e.g., instructions run on aprocessing device), or a combination thereof. In one embodiment, method1150 is performed by a de-duplication module (e.g., de-duplicationmodule 390 in FIG. 3) hosted by storage servers 110 of FIG. 1A.

At instruction block 1101, according to one embodiment, a FPDS isorganized and maintained as a segmented datastore, such as a master FPDS1171A-0 and FPDS segments 1171A-1 to 1171A-n. A master FPDS 1171A-0datastore stores an entry (e.g., a fingerprint record) for each datablock that is written to a file system. A datastore segment storesfingerprints entries for new and modified data blocks written to storage(e.g., storage 170A,B in FIG. 1A).

At instruction block 1103, when new data blocks are written or updatedin a file system, the method creates and logs new fingerprint entriesinto a changelog file 1181A. At instruction block 1105, the methoddetermines whether there is a de-duplication start request. For example,the method detects a sis start command. If there is not a de-duplicationstart request (instruction block 1105), the method determines whetherthere is a request for a verify operation to be performed at instructionblock 1123. If there is not a verify operation to be performed(instruction block 1123), the method returns to instruction block 1101to maintain a sorted master datastore.

If a de-duplication start request is detected (instruction block 1105),the method sorts the entries in the changelog 1181A, for example, byfingerprint value at instruction block 1107. At instruction block 1109,the method determines whether the current number of entries in themaster datastore meets a master datastore threshold. In one embodiment,the method overwrites the master FPDS 1171A-0 with every de-duplicationoperation until a master datastore threshold is reached. A masterdatastore threshold can be a user-defined threshold and can be stored asa parameter, for example in a data store (e.g., data store 1075 in FIG.10). For example, the master datastore threshold is a comparator (e.g.,less than, less than or equal to, greater than, greater than or equalto, etc.) and a number of entries.

At instruction block 1109, if the master datastore threshold is reached(e.g., a threshold for a number of fingerprint entries in the masterdatastore), the method determines whether the current number of segmentsmeets a segment count threshold at instruction block 1111. The segmentcount threshold can be comparator (e.g., less than, less than or equalto, greater than, greater than or equal to, etc.) and a number ofsegments. The segment count threshold can be a user-defined thresholdand can be stored as a parameter, for example in a data store (e.g.,data store 1075 in FIG. 10). For example, the segment count threshold isless than or equal to a value of 40.

If the segment count threshold (e.g., 40 segments) has been met(instruction block 1111), the method performs an on-disk merge of thefingerprint entries in the sorted changelog 1181D, the entries in allexisting the FPDS segments (e.g., 1171A-1 to 1171A-n), and the entriesin the master FPDS 1171A-0, and overwrites the old master FPDS 1171A-0with the on-disk merged data to create a master FPDS 1171B-0 inaggregate at instruction block 1119. The master FPDS 1171B-0 inaggregate can be used to identify and eliminate duplicate data blocks atinstruction block 1121.

If the segment count threshold has not been reached (instruction block1111), the method writes the fingerprint entries in a sorted changelogto a new datastore segment, referred to as a FPDS segment or segment atinstruction block 1113. The original master FPDS remains sorted and ismaintained as is. In one embodiment, where changelogs do not meet achangelog threshold, the method appends the fingerprint entries in asorted changelog to the last FPDS segment, if the last segment size isless than a segment size threshold. A changelog threshold can be auser-defined threshold and can be stored as a parameter, for example ina data store (e.g., data store 1075 in FIG. 10). A segment can have amaximum segment size (e.g., 8 GB, 16 GB), also referred to as a ‘segmentsize threshold’. Examples of a segment size threshold include, and arenot limited to, a size of a file, a number of fingerprints, etc. Thesegment size threshold can be a user-defined threshold and can be storedas a parameter, for example in a data store (e.g., data store 1075 inFIG. 10). The segment size threshold can be determined from a changerate (e.g., 2% change rate) and a size of a volume (e.g., 16 TB volumesize). For example, the segment size threshold is 8 GB.

At instruction block 1115, the method sorts the entries in the segment1171A-1 by fingerprint value and performs an in-memory merge of theentries in the segment 1171A-1 with the entries in the sorted masterFPDS 1171A-0 at instruction block 1117 to identify and eliminateduplicate data blocks at instruction block 1121.

At instruction block 1123, the method determines whether there is arequest for a verify operation to be performed. If there is not a verifyoperation to be performed, the method returns to instruction block 1101to maintain a sorted master datastore. A verify operation(identification and removal of stale fingerprint entries) can beautomatically triggered when a number of stale entries in a FPDS reachesor exceeds a stale entries threshold. In another example, a verifyoperation is triggered from a CLI. In another example, a verifyoperation is user-driven, for example, by the method detectinginstructions entered by a user via a command line interface. If there isa request for a verify operation (instruction block 1123), the methoddetermines whether all of the segments, the changelog, and the oldmaster FPDS 1171A-0 have already been merged on-disk and the old masterFPDS 1171A-0 has been overwritten with the on-disk merged data to createa master FPDS 1171B-0 in aggregate at instruction block 1125.

If the a master FPDS in aggregate 1171B-0 has not been created(instruction block 1125), the method performs an on-disk merge of thefingerprint entries in the sorted changelog 1181D, the entries in all ofthe existing FPDS segments (e.g., 1171A-1 to 1171A-n), and the entriesin the master FPDS 1171A-0, and overwrites the old master FPDS 1171A-0with the on-disk merged data to create a master FPDS 1171B-0 inaggregate at instruction block 1125. The master FPDS 1171B-0 inaggregate can be used for the verify operation (to identify andeliminate stale fingerprint entries) at instruction block 1127.Embodiments of methods for identifying and removing stale fingerprintentries is described in greater detail in conjunction with FIGS. 16-20.

FIG. 12 illustrates the elements of a de-duplication module 1200 (e.g.,de-duplication module 390 in FIG. 3) for addressing a data block in avolume using a virtual volume block number (VVBN), according to certainembodiments. The elements include a de-duplication engine 1210, agatherer module 1220, a fingerprint manager 1230, a fingerprint handler1240, a block sharing engine 1250, and a stale fingerprint manager 1280.The de-duplication module 1200 can be coupled to a fingerprintsdatastore that stores the fingerprints of all data blocks that have beenwritten to storage (e.g., storage 170A,B in FIG. 1A).

Conventional de-duplication solutions refer to a data block in a volumeusing fingerprint entries that contain logical data, such as an inodeand file block number (e.g., <inode, fbn>). Since, these traditionalsolutions refer to each block logically, a fingerprints datastore (FPDS)needs to store each reference to one physical block. Storing fingerprintentries for each logical block adds overhead to a de-duplicationoperation and does not allow a FPDS to scale easily with increases inshared data in a volume.

According to certain embodiments, a FPDS 1260 is improved by referencinga data block in a volume uniquely using a virtual volume block number(VVBN) instead of using <inode,fbn>. By using a VVBN to refer to a datablock, a FPDS can more easily scale with increased block sharing. FIG.13 illustrates a block diagram of mapping 1300 a data block from a file1301 to a storage device 1307B (e.g., disks), according to certainembodiments. Mapping 1300 illustrates a single block 1309 as part ofseveral logical and physical storage containers—a file 1301, a containerfile 1303 holding a flexible volume, an aggregate 1305, and a storagedevice 1307B. Each provides an array of blocks indexed by theappropriate type of block number. The file 1301 is indexed by file blocknumber (FBN) 1351, the container file 1303 by virtual volume blocknumber (VVBN) 1353, and the aggregate 1305 by physical volume blocknumber (PVBN) 1355. The storage devices (e.g., disks) 1307 are indexedby disk block number (DBN) 1357.

To translate an FBN 1351 to a disk block, a file system, such as WAFL,goes through several steps. At reference 1310, the file system uses thefile's 1301 inode and buffer tree to translate the FBN 1351 to a VVBN1353. At reference 1320, the file system translates the VVBN 1353 to aPVBN 1355 using the container file's 1303 inode and buffer tree. Atreference 1330, RAID translates the PVBN 1355 to a DBN 1357. Atreference 1340, the file system can use an alternative shortened methodprovided by dual VBNs to bypasses the container map's VVBN-to-PVBNtranslation. A file system can store PVBNs 1355 in the file's buffertree to bypass the container map's VVBN-to-PVBN translation.

Returning to FIG. 12, when de-duplication runs for the first time on aflexible volume with existing data, the de-duplication 1200 module scansthe blocks in the flexible volume and creates a FPDS 1260, whichcontains a sorted list of all fingerprints for used blocks in theflexible volume, according to some embodiments. In one embodiment, theFPDS 1260 stores an entry (e.g., fingerprint record) for each data blockthat is written to the storage (e.g., storage 170A,B in FIG. 1A). Oneembodiment for dividing a FPDS into multiple parts, such as a primarydatastore and a secondary datastore, is described in greater detail inconjunction with FIGS. 9A-9B. Another embodiment for organizing a FPDSinto segments is described in greater detail in conjunction with FIGS.11A-11B.

The de-duplication engine 1210 schedules and triggers operations of theother modules. In particular, the de-duplication engine 1210 triggersoperation of the gatherer module 1220, which may be done according to apre-specified schedule, timing algorithm, or in response to a manualinput. The de-duplication engine 1210 detects a request to start ade-duplication operation (e.g., sis start command) and starts ade-duplication operation. For example, the de-duplication engine 1210invokes the gatherer module 1220. In one embodiment where a verifyoperation is currently in progress, the de-duplication engine 1210detects a de-duplication start request and notifies the stalefingerprint manager 1280. One embodiment for invoking a de-duplicationoperation and performing a verify operation in the background isdescribed in detail in conjunction with FIGS. 19-20.

The gatherer module 1220 identifies each data block that has beenwritten to storage (e.g., storage 170A,B in FIG. 1A) and triggers thefingerprint handler 1240 to compute fingerprints (e.g., a checksum) forthe data blocks and return them to the gatherer module 1220. Thefingerprint manager 1230 receives the fingerprints of the data blockthat has been written to storage from the gatherer module 1220 andcreates and stores an entry (e.g., fingerprint record) for each datablock that is written to the storage (e.g., storage 170A,B in FIG. 1A)in a FPDS 1260.

Typically, an entry in a FPDS can include, and is not limited to, thefingerprint of the block (‘fp’), context information, such as, the valueof a consistency point counter (e.g., a generation time stamp(‘cp-cnt’)) at the time the block was written to a storage device (e.g.,disk), and logical data, such as the inode number of the file to whichthe block belongs (‘inode’) and the FBN (file block number) of the block(‘fbn’). According to certain embodiments, an entry (e.g., fingerprintrecord) is improved by including, and is not limited to, the fingerprintof the block (‘fp’), context information, such as, the value of aconsistency point counter (e.g., a generation time stamp (‘cp-cnt’)) atthe time the block was written to a storage device (e.g., disk), andphysical data, such as a container file identifier (‘Container-FileID’)and the VVBN (virtual volume block number) of the block (‘vvbn’). TheFPDS 1260 reduces to a map file that can be indexed by VVBN, accordingto certain embodiments. Instead of using an inode and FBN 1351 to referto a block 1309 in a volume, the de-duplication module 1200 can use VVBN1353 to refer to a block 1309.

The gatherer module 1220 also identifies new data blocks that arewritten or updated in storage (e.g., storage 170A,B in FIG. 1A) andtriggers the fingerprint handler 1240 to compute fingerprints of the newdata blocks and return them to the gatherer module 1220. The fingerprintmanager 1230 creates and stores fingerprint entries for the new datablocks in a changelog file (e.g., changelog 1270) that is coupled to thede-duplication module 1200.

Typically, a conventional changelog contains information of the sametype as a FPDS (i.e., fp, inode, FBN), but only for new or modifiedblocks. A new or modified data block includes data blocks that writtenor updated in storage (e.g., storage 170A,B in FIG. 1A) since a lastde-duplication operation. According to certain embodiments, thefingerprint manager 1230 creates and stores fingerprint entries in achangelog 1270 that include, and are not limited to, the inode number ofthe file (e.g., ‘inode’) to which the block belongs, the FBN (file blocknumber) of the block (e.g., ‘fbn’), the VVBN (virtual volume blocknumber) of the block (e.g., ‘vvbn’), the fingerprint of the block (e.g.,‘fp’), and a generation time stamp (e.g. ‘cp-cnt’).

The fingerprint manager 1230 detects a de-duplication start request(e.g., sis start command) and sort the entries in the FPDS 1260 and inthe changelog 1270 by fingerprint value and merge the entries in thechangelog 1270 with the entries in the FPDS 1260 to identify potentiallyduplicate data blocks.

Unlike conventional de-duplication solutions which load a potentialduplicate data block using an inode and FBN (e.g., <inode, fbn>),sharing engine 1250 loads a potential duplicate data block using a VVBN,according to certain embodiments. The block sharing engine 1250 performsa byte-by-byte analysis of the loaded potentially duplicate data blocksto identify actual duplicate data blocks. The blocks represented by anyentries which have identical fingerprints are considered to be potentialduplicate blocks, rather than actual duplicates, since there is always apossibility that two non-identical blocks could have the samefingerprint, regardless of the fingerprint scheme being used. The blocksharing engine 1250 can eliminate the actual duplicate data blocks andimplement block sharing by calling functions of a file system (e.g.,file system 310 in FIG. 3). Eliminating the duplicate data blocksincludes sharing the remaining instance of each data block that wasduplicated and freeing the (no longer used) duplicate data block(s). Oneembodiment of a method for eliminating a data block, such as a duplicateblock, is described in greater detail in conjunction with FIG. 23.

In one embodiment, the fingerprint manager 1230 overwrites the currentFPDS 1260 with the merged entries to create a new FPDS (e.g.,fingerprint.next). The new FPDS can be used for a verify operation toidentify and remove stale fingerprint entries that correspond toeliminated data blocks and/or for a subsequent de-duplication operation.

The stale fingerprint manager 1280 performs a verify operation toidentify and remove ‘stale’ fingerprint entries from a FPDS 1260,according to one embodiment. In another embodiment, the stalefingerprint manager 1280 removes stale fingerprints during a subsequentde-duplication operation. A stale entry (e.g., stale fingerprint record)is an entry that has a fingerprint that corresponds to a data block thathas been eliminated (deleted or overwritten) by the block sharing engine1250. The fingerprint manager 1230 saves context information in thefingerprints entries for each block, such as the value of a consistencypoint counter (e.g., ‘cp-cnt’) at the time the block was written to astorage device (e.g., disk).

The stale fingerprint manager 1280 sorts the FPDS 1260 (e.g.,fingerprint.next) by VVBN and uses the context information (e.g.,‘cp-cnt’) to identify stale fingerprint entries in the FPDS 1260.Sorting by VVBN ensures that only the latest copy of VVBN (one with thehighest cp-cnt) is retained and all others are removed from the FPDS1260. Entries having higher consistency point counter values are morerecent than entries with lower consistency point counter values. Thestale fingerprint manager 1280 identifies fingerprint entries having thesame VVBN as other entries, but with lower consistency point countervalues compared to the other entries, as stale fingerprint entries. Theunidentified entries are stale free fingerprint entries.

The stale fingerprint manager 1280 checks the VVBN for the identifiedentries to check if it is valid. The stale fingerprint manager 1260 canexamine an active map to ensure a VVBN is valid (ensure that a VVBN hasnot changed). When the stale fingerprint manager 1280 determines that aVVBN is not valid, it deletes the stale fingerprint entries from theFPDS 1260 (e.g., fingerprint.next). The stale fingerprint manager 1280also determines whether an entry (e.g., record) is an entry havinglogical or physical data. An entry (e.g., fingerprint record) caninclude data indicating the type (e.g., physical, logical) of entry. Thestale fingerprint manager 1280 also checks a ‘refcount’ for the VVBN toensure that the VVBN is shared. One embodiment of a method ofidentifying and removing stale fingerprint entries using VVBNs isdescribed in greater detail in conjunction with FIG. 18. Complementaryto this functionality, information on the deleted files and/or blocks inthe deletion code path is also logged and used to clean up staleentries.

FIG. 14A is a block diagram 1400 for addressing a data block in a volumeusing a virtual volume block number (VVBN), according to certainembodiments. A storage server is coupled to storage (e.g., storage170A,B in FIG. 1A) storing data blocks of data, and generates afingerprint for each data block in the storage. The storage servercreates a fingerprints datastore (FPDS) that stores an entry (e.g.,fingerprint record) for each data block that has been written tostorage. In one embodiment, the FPDS stores an entry for each uniquefingerprint. One embodiment of a FPDS that is divided into multipleparts, such as a primary datastore and a secondary datastore, isdescribed in greater detail in conjunction with FIGS. 9A-9B. Anotherembodiment of a FPDS that is organized into segments is described ingreater detail in conjunction with FIGS. 11A-11B.

A fingerprints datastore 1471A stores an entry (e.g., fingerprintrecord) for each data block that is written to storage (e.g., storage170A,B in FIG. 1A). According to certain embodiments, an entry caninclude, and is not limited to, a container file identifier (e.g.,‘Container-FileID’), the fingerprint of the block (e.g., ‘fp’), the VVBN(virtual volume block number) of the block (e.g., ‘vvbn’), and ageneration time stamp (e.g. ‘cp-cnt’). The entries in the FPDS 1471A aresorted by fingerprint value.

A changelog 1473A stores fingerprint entries for new or modified blocks.A new or modified data block includes data blocks that written orupdated in storage since a last de-duplication operation. According tocertain embodiments, an entry (e.g., fingerprint record) in a changelog1473 can include, and is not limited to, the inode number of the file(e.g., ‘inode’) to which the block belongs, the FBN (file block number)of the block (e.g., ‘fbn’), the VVBN (virtual volume block number) ofthe block (e.g., ‘vvbn’), the fingerprint of the block (e.g., ‘fp’), anda generation time stamp (e.g. ‘cp-cnt’). The entries in the changelog1473A are sorted by fingerprint value.

The entries in the sorted changelog 1473B are merged with the entries inthe sorted FPDS 1471A. Entries that have identical fingerprint valuesare identified as entries that correspond to potential duplicate datablocks. The potential duplicate data blocks are loaded. Unlikeconventional de-duplication solutions which load a potential duplicatedata block using an inode and FBN (e.g., <inode, fbn>), embodiments loada potential duplicate data block using a VVBN. The potential duplicatedata blocks are further analyzed to identify actual duplicate datablocks and the actual duplicate blocks are eliminated. The current FPDS(e.g., FPDS 1471A) is overwritten with the merged entries to create anew FPDS. The new FPDS can be used for a verify operation to identifyand remove stale fingerprint entries that correspond to eliminated datablocks and/or for a subsequent de-duplication operation.

FIG. 14B is a flow diagram of a method 1450 for addressing a data blockin a volume using a virtual volume block number (VVBN), according tocertain embodiments. The flow diagram corresponds to block diagram 1400in FIG. 14A. Method 1450 can be performed by processing logic that cancomprise hardware (e.g., circuitry, dedicated logic, programmable logic,microcode, etc.), software (e.g., instructions run on a processingdevice), or a combination thereof. In one embodiment, method 1450 isperformed by a de-duplication module (e.g., de-duplication module 390 inFIG. 3) hosted by storage servers 110 of FIG. 1A.

At instruction block 1401, the method maintains a sorted fingerprintsdata 1471A. The fingerprints datastore 1471A stores an entry (e.g.,fingerprint record) for each data block that is written to storage(e.g., storage 170A,B in FIG. 1A). According to certain embodiments, anexemplary entry can include, and is not limited to, <Container-FileID,fp, cp-cnt, VVBN>. The entries can be sorted by fingerprint value foridentifying potential duplicate data blocks. At instruction block 1403,the method detects a de-duplication operation start request (e.g. sisstart command) and sorts the entries in a changelog 1473A, for example,by fingerprint, at instruction block 1405. At instruction block 1407,the method merges the entries in the sorted changelog 1473B with theentries in the FPDS 1471A.

At instruction block 1409, during the merging process, the methodidentifies potential duplicate data blocks by determining which, if any,entries in the changelog 1473A have the same fingerprint value as theentries in the FPDS 1471A. The fingerprint entries having the samefingerprint values pertain to potentially duplicate data blocks andthese entries can be written to an output datastore (e.g., output file).If there are fingerprint entries that have the same fingerprint value(instruction block 1409), the method loads the potential duplicate datablock using VVBN at instruction block 1411 and performs a byte-by-byteanalysis to determine whether any of the loaded data blocks are actualduplicate data blocks at instruction block 1413. If there are any actualduplicate data blocks (instruction block 1415), the actual duplicatedata blocks are eliminated at instruction block 1417. Eliminating theduplicate data blocks includes sharing the remaining instance of eachdata block that was duplicated and freeing the (no longer used)duplicate data block(s). One embodiment of a method for eliminating adata block, such as a duplicate block, is described in greater detail inconjunction with FIG. 23. At instruction block 1419, the methodoverwrites the existing FPDS 1471A with the merged data to create a newFPDS 1471B. The new FPDS can be used for a verify operation to identifyand remove stale fingerprint entries that correspond to eliminated datablocks and/or for a subsequent de-duplication operation.

FIG. 15 illustrates the elements of a stale fingerprint manager 1500(e.g., stale fingerprint manager 680 in FIG. 6) for identifying andremoving stale fingerprint entries when a next de-duplication operationis invoked, according to certain embodiments. The elements include averify trigger detector 1510, a stale entry identifier 1520, a staleentry manager 1530, and a data sorter 1540.

During a first phase of a de-duplication process (a de-dupe operation),duplicate data blocks are identified and eliminated. The fingerprintentries that correspond to the eliminated duplicate data blocks andremain in a FPDS are referred to as ‘stale’ fingerprint entries. Duringa second phase of a de-duplication process (a verify operation \orverify scan), stale fingerprint entries are identified and removed froma FPDS.

Conventional implementations of a verify operation include two stages.In stage one, a trigger to invoke a verify operation is detected and theentries in the FPDS are first sorted (Sort #1) by <file identifier,block offset in a file, time stamp> (e.g., <inode, fbn, cp-cnt>) order.The verify operation checks whether any of the fingerprint entries arestale, and overwrites the existing FPDS with only the stale-free entriesto create a new stale-free FPDS. In stage two, the output from stage oneis sorted (Sort #2) a second time back to its original order, such asfingerprint value, inode, file block number (e.g., <fp, inode, fbn>).One problem with this conventional approach is that it sorts the FPDStwice with each verify operation. The second sort (Sort #2) duringverify stage two is unnecessary to remove the stale entries. Anotherproblem with the convention approach is that it overwrites the entireFPDS with stale-free entries, even if the number of stale entries is asmall percentage of the FPDS.

One aspect a verify operation optimizes current stale entries removal byreducing the time to sort the fingerprints datastore by recording staleentry information to a separate datastore (e.g., stale entries file)which would be proportional to size of stale entries in the fingerprintsdatastore, rather than rewriting the entire fingerprints datastore.During a subsequent de-duplication operation, the entries in the staleentries datastore are merged with the entries in the fingerprintsdatastore and the stale entries are removed during a de-duplicationoperation when there is a full read/write of the entire fingerprintsdatastore. Thus, the second sort of the fingerprints datastore in aconventional solution is eliminated.

The stale fingerprint manager 1500 is coupled to a fingerprintsdatastore (FPDS) 1550 that stores an entry (e.g., fingerprint record)for each data block that has been written to storage (e.g., storage170A,B in FIG. 1A). One embodiment of a FPDS that is divided intomultiple parts, such as a primary datastore and a secondary datastore,is described in greater detail in conjunction with FIGS. 9A-9B. Anotherembodiment of a FPDS that is organized into segments is described ingreater detail in conjunction with FIGS. 11A-11B.

In one embodiment, an entry can include, and is not limited to, thefingerprint of the block (‘fp’), context information, such as, the valueof a consistency point counter (e.g., a generation time stamp(‘cp-cnt’)) at the time the block was written to a storage device, andlogical data, such as the inode number of the file to which the blockbelongs (‘inode’) and the FBN (file block number) of the block (‘fbn’).In another embodiment, an entry (e.g., fingerprint record) can include,and is not limited to, the fingerprint of the block (‘fp’), contextinformation, such as, the value of a consistency point counter (e.g., ageneration time stamp (‘cp-cnt’)) at the time the block was written to astorage device, and physical data, such as a container file identifier(‘Container-FileID’) and the VVBN (virtual volume block number) of theblock (‘vvbn’).

A verify operation (stale entries removal operation) can beautomatically triggered when the number of stale entries in a FPDSreaches or exceeds a stale entries threshold, for example, when a numberof stale fingerprint entries in a FPDS is beyond 20%. The verify triggerdetector 1510 determines a current number of stale entries in a FPDS1550 and compare the current number to a stale entries threshold. Thestale entries threshold can be a user-defined threshold stored as aparameter in a data store 1570 that is coupled to the stale fingerprintmanager 1500. In another example, a verify operation is triggered from aCLI. In another example, a verify operation is user-driven, for example,by the method detecting instructions entered by a user via a commandline interface.

When the verify trigger detector 1510 detects a trigger to execute averify operation, the data sorter 1540 sorts the entries in the FPDS1550. In one embodiment, the data sorter 1540 sorts the entries in theFPDS 1550 using logical data and the context information (e.g., by <fileidentifier, block offset in a file, time stamp>, such as <inode, fbn,cp-cnt> order). In another embodiment, the data sorter 1540 sorts theentries in the FPDS 1550 using physical data and the context information(e.g., by <vvbn, cp-cnt> order).

The stale entry identifier 1520 determines whether a stale entriesdatastore (e.g., stale entries file) 1560 exists. When a stale entriesdatastore 1560 does not yet exist, the stale entry identifier 1520 usescontext information that is stored in the FPDS 1550 to identify stalefingerprint entries. Fingerprint entries having higher consistency pointcounter values are more recent than entries with lower consistency pointcounter values. In one embodiment, the stale entry identifier 1520identifies fingerprint entries having the same <inode, fbn> as otherentries, but with lower consistency point counter values compared to theother entries, as stale fingerprint entries. In another embodiment, thestale entry identifier 1520 identifies fingerprint entries having thesame <vvbn> as other entries, but with lower consistency point countervalues compared to the other entries, as stale fingerprint entries. Theunidentified entries are stale free fingerprint entries.

When a stale entries datastore 1560 exists, the stale entry identifier1520 compares the entries in the FPDS 1550 with the entries in the staleentries datastore 1560 to identify stale fingerprint entries. Thede-duplication module identifies fingerprint entries having the samefingerprint values as the fingerprint entries in the stale entriesdatastore 1560 as stale fingerprint entries. The unidentified entriesare stale free fingerprint entries.

The stale entry manager 1530 creates and updates the stale entriesdatastore 1560. In one embodiment, the stale entry manager 1530 createscopies of the identified stale entries in the stale entries datastore1560. In another embodiment, the stale entry manager 1530 does notcreate copies of the stale entries in the stale entries datastore 1560,but writes stale entry information for each of the identified stalefingerprint entries to the stale entries datastore 1560. The stale entryinformation for each entry can include, and is not limited to, an entryindex (e.g., record index), inode, fbn, inode generation count, and ablock generation time stamp. By storing an entry index and entryinformation, rather than an entry itself, the size of the stale entriesdatastore 1560 can be reduced. With this optimization, the stale entriesdatastore 1560 contains entry indices of the stale fingerprint entries,and stale fingerprint manager 1500 can use this datastore 1560 ofindices to remove the stale fingerprint entries from the FPDS 1550.

FIG. 16A is a block diagram 1600 for removing stale fingerprint entriesfrom a fingerprints datastore (FPDS) 1671A when a next de-duplicationoperation is invoked, according to certain embodiments. One embodimentof a FPDS that is divided into multiple parts, such as a primarydatastore and a secondary datastore, is described in greater detail inconjunction with FIGS. 9A-9B. Another embodiment of a FPDS that isorganized into segments is described in greater detail in conjunctionwith FIGS. 11A-11B.

In one embodiment, there is a FPDS 1671A that stores an entry (e.g.,fingerprint record) for each data block that is written to the storage(e.g., storage 170A,B in FIG. 1A). In another embodiment, there is amaster FPDS (e.g., segment.0 1171A-0 in FIG. 11A) and segments (e.g.,segments 1171A-1 to 1171A-n in FIG. 11A). In one embodiment, an entrycan include, and is not limited to, the fingerprint of the block (‘fp’),context information, such as, the value of a consistency point counter(e.g., a generation time stamp (‘cp-cnt’)) at the time the block waswritten to a storage device, and logical data, such as the inode numberof the file to which the block belongs (‘inode’) and the FBN (file blocknumber) of the block (‘fbn’). In another embodiment, an entry (e.g.,fingerprint record) can include, and is not limited to, the fingerprintof the block (‘fp’), context information, such as, the value of aconsistency point counter (e.g., a generation time stamp (‘cp-cnt’)) atthe time the block was written to a storage device, and physical data,such as a container file identifier (‘Container-FileID’) and the VVBN(virtual volume block number) of the block (‘vvbn’).

For illustration purposes, one embodiment of a verify operation foridentifying and removing stale fingerprint entries is described as threestages 1635, 1637, and 1639. During verify stage one 1635, a trigger fora verify operation is detected and the entries in the FPDS datastore1671A are sorted. In one embodiment, the entries in the FPDS 1671A aresorted by logical data and the context information (e.g., by <fileidentifier, block offset in a file, time stamp>, such as <inode, fbn,p-cnt> order). In another embodiment, the entries in the FPDS 1671A aresorted by physical data and the context information (e.g., by <vvbn,cp-cnt> order). The stale fingerprint entries are identified from thecontext information and stale entry information for the stale entriesare written to a stale entries datastore 1675A.

During verify stage two 1637, the stale entries in the stale entriesdatastore are sorted (e.g., by entry index). Verify stage three 1639occurs during a subsequent de-duplication operation 1647. A nextde-duplication start request is detected, and during verify stage three1639, the entries in the sorted stale entries datastore 1675B are mergedin-memory with the FPDS 1671A, according to one embodiment. In anotherembodiment, during verify stage three 1639, the entries in the sortedstale entries datastore 1675B are merged in-memory with the master FPDS(e.g., segment.0 1171A-0 in FIG. 11A) and segments (e.g., segments1171A-1 to 1171A-n in FIG. 11A). While the data is being mergedin-memory, the entries are compared to identify any entries in the FPDS1671A (or the master FPDS and the segments) that correspond to an entryin the stale entries datastore 1615 to identify stale entries. Theidentified stale entries are removed and the FPDS 1671A (or master FPDS)is overwritten with the stale-free entries to create a stale-free FPDS1671B. This stale-free FPDS 1671B can be used in performing thede-duplication operation to identify and eliminate duplicate blocks.

FIG. 16B is a flow diagram of a method 1650 for removing stalefingerprint entries when a next de-duplication operation is invoked,according to certain embodiments. The flow diagram corresponds to blockdiagram 1600 in FIG. 16A. Method 1650 can be performed by processinglogic that can comprise hardware (e.g., circuitry, dedicated logic,programmable logic, microcode, etc.), software (e.g., instructions runon a processing device), or a combination thereof. In one embodiment,method 1650 is performed by a de-duplication module (e.g.,de-duplication module 390 in FIG. 3) hosted by storage servers 110 ofFIG. 1A.

The de-duplication module is coupled to storage (e.g., storage 170A,B inFIG. 1A) storing data blocks of data, and generates a fingerprint foreach data block in the storage. The de-duplication module is coupled toa fingerprints datastore (FPDS) that stores an entry (e.g., fingerprintrecord) for each data block that has been written to storage. In oneembodiment, the FPDS stores an entry (e.g., fingerprint record) for eachunique fingerprint. One embodiment of a FPDS that is divided intomultiple parts, such as a primary datastore and a secondary datastore,is described in greater detail in conjunction with FIGS. 9A-9B. Inanother embodiment, there is a master FPDS (e.g., segment.0 1171A-0 inFIG. 11A) and segments (e.g., segments 1171A-1 to 1171A-n in FIG. 11A).

At instruction block 1601, the method detects a trigger to invoke averify operation. A verify operation can be automatically triggered whenthe number of stale entries in a FPDS reaches or exceeds a stale entriesthreshold, for example, when a number of stale fingerprint entries arebeyond 20%. In another example, a verify operation is triggered from aCLI. In another example, a verify operation is user-driven, for example,by the method detecting instructions entered by a user via a commandline interface.

At instruction block 1603, the method sorts the entries in the FPDS. Inone embodiment, the method sorts the entries in the FPDS using logicaldata and the context information (e.g., by <file identifier, blockoffset in a file, time stamp>, such as <inode, fbn, cp-cnt> order). Inanother embodiment, the method sorts the entries in the FPDS usingphysical data and the context information (e.g., by <vvbn, cp-cnt>order).

At instruction block 1605, the method identifies the stale fingerprintentries from the context information for each entry. Fingerprint entriesin the FPDS with higher consistency point counter values (e.g., cp-cnt)are more recent than entries with lower consistency point countervalues. In one embodiment, the method identifies entries that correspondto freed data blocks using <inode,fbn> and identifies entries that havethe same <inode, fbn>, as other entries, but have a lower consistencypoint counter value compared to the other entries, as stale entries. Inanother embodiment, the method identifies entries that correspond tofreed data blocks using <vvbn> and identifies entries that have the same<vvbn> as other entries, but have a lower consistency point countervalue compared to the other entries, as stale entries. The unidentifiedentries are stale free fingerprint entries. Complementary to thisfunctionality, information on the deleted files and/or blocks in thedeletion code path is also logged and used to clean up stale entries.

At instruction block 1607, in one embodiment, the method copies thestale entries into a stale entries datastore (e.g., fingerprint.stale).Each entry in the stale entries datastore can include a segmentidentifier. In another embodiment, the size of the stale entriesdatastore is minimized by storing information for the stale entriesrather than copying the stale entries themselves. Stale entryinformation for a stale entry can include, and is not limited to, aentry number (e.g., entry index) for the stale entry, inode, file blocknumber, generation time stamp, etc. By storing entry information insteadof a copy of the stale entry, an entry size in the stale entriesdatastore can be reduced from 32 bytes to 24 bytes. In this embodiment,the stale entries datastore only contains the entry indexes of the stalefingerprint entries and corresponding entry information (e.g., segmentidentifier), rather than the copies of the entries themselves. The staleentries datastore can contain only the entry indexes based on theassumption that the FPDS will not be changed before removing the staleentries using the stale entries datastore. If the FPDS is changed beforeusing the stale entries datastore to remove stale entries from the FPDS,the FPDS indexing scheme will not be valid.

During verify stage two, at instruction block 1609, the method sorts theentries in the stale entries datastore. The entries can be sorted byentry index. At instruction block 1611, the method detects ade-duplication start request (e.g., sis start command). In response todetecting a de-dupe start request, verify stage three begins, and themethod determines whether a stale entries datastore exists atinstruction block 1613. If there is not a stale entries datastore, themethod continues to instruction block 1621. For example, a verifyoperation may not have been previously executed to create a staleentries datastore.

If there is a stale entries datastore, the method performs an in-memorymerge of the entries in the sorted stale entries datastore with the FPDSat instruction block 1614, according to one embodiment. In anotherembodiment, the method performs an in-memory merge of the entries in thesorted stale entries datastore with a master FPDS and all of thesegments at instruction block 1614.

At instruction block 1615, while the data is being merged in-memory, themethod compares the entries to identify any entries in the FPDS thatcorrespond to an entry in the stale entries datastore to identify thestale fingerprints to be removed from the FPDS, according to oneembodiment. In another embodiment where a FPDS is organized as asegmented FPDS, the method compares the entries in the master FPDS andthe entries in all of the FPDS segments with the entries in the staleentries datastore to identify the stale fingerprints to be removed fromthe FPDS at instruction block 1615. In one embodiment, while merging thechangelog entries with the FPDS entries, each entry from the FPDS iscross checked against the sorted stale entries datastore. If an entry inthe FPDS corresponds (e.g., matches) an entry in the stale entriesdatastore, the method identifies the entry as a stale entry. In anotherembodiment, the stale entries datastore stores an entry index of thestale entries. The FPDS should remain unchanged in order for theindexing scheme in the stale entries datastore to remain valid and priorto the changelog entries merging with the FPDS entries, the entry indexinformation in the stale entries datastore is compared to the entries inthe FPDS. If there is a match between an entry in the FPDS with theentry index in the stale entries datastore, the method identifies theentry as a stale entry.

At instruction block 1617, the method removes the stale entries from theFPDS. The method can purge the stale entries. At instruction block 1619,the method overwrites the existing FPDS (e.g., FPDS or master FPDS) withthe stale-free entries to create a stale-free FPDS (e.g., stale-freemaster FPDS). At instruction block 1621, the method continues with thede-duplication operation to identify and eliminate duplicate data blocksusing the stale-free FPDS.

FIG. 17A is a block diagram 1700 of a verify operation to identify andremove stale fingerprint entries using a primary FPDS and a secondaryFPDS, according to certain embodiments. One embodiment maintains asorted primary FPDS 1771A and a sorted secondary FPDS 1775A. A primaryFPDS 1771A contains an entry (e.g., fingerprint record) for each uniquefingerprint value. A secondary FPDS 1775A contains fingerprints entriesthat have the same fingerprint value as an entry in the primary FPDS1771A. In one embodiment, an entry can include, and is not limited to,the fingerprint of the block, the inode number of the file to which theblock belongs, and the FBN (file block number) of the block. In certainembodiments, each fingerprint is a checksum, such as an MD5 checksum.

A de-duplication operation 1751, is triggered by a de-duplicationoperation start request (e.g., sis start command) and sorts a changelog1773A by fingerprint. The de-duplication performs an in-memory merge ofthe sorted changelog 1773B with the sorted primary FPDS 1771A toidentify and eliminate duplicate data blocks. The de-duplication writesthe fingerprint entries that correspond to the eliminated duplicate datablocks to a third datastore (e.g., datastore 1777) and overwrites theprimary FPDS 1771A with the fingerprint entries that correspond to theunique data blocks to create an updated primary FPDS 1071B.

A verify operation 1753 includes an in-memory merge of the entries inthe third datastore 1777 with the entries in the secondary FPDS 1775A,and then with the entries in the primary FPDS 1771B, according to oneembodiment. In another embodiment, the third datastore 1777 is asecondary datastore (e.g., tempfile.x) for each de-dupe operationbetween verify operations. In one embodiment, the secondary datastore(e.g., tempfile.x) is a segmented datastore, as described in conjunctionwith FIG. 11A. The secondary datastore can be sorted in <inode, fbn>order. Later during the verify operation all tempfile.x files can bemerged on-disk and sorted in <inode, fbn> order to optimize a verifyoperation.

The verify operation identifies and removes stale entries from themerged data and writes the remaining stale-free entries to a stale-freedatastore. The verify operation sorts the stale-free data by fingerprintand identifies the entries that correspond to duplicate data blocks. Theverify operation writes the identified entries to a second datastore tocreate a updated secondary FPDS 1775B and overwrites the existingprimary FPDS 1771B with the fingerprint entries for the unique datablocks to create an updated primary FPDS 1771C.

FIG. 17B is a flow diagram of a method 1750 of a verify operation toidentify and remove stale fingerprint entries using a primary FPDS and asecondary FPDS, according to certain embodiments. The flow diagramcorresponds to block diagram 1700 in FIG. 17A. Method 1750 can beperformed by processing logic that can comprise hardware (e.g.,circuitry, dedicated logic, programmable logic, microcode, etc.),software (e.g., instructions run on a processing device), or acombination thereof. In one embodiment, method 1750 is performed by ade-duplication module (e.g., de-duplication module 390 in FIG. 3) hostedby storage servers 110 of FIG. 1A.

At instruction block 1701, the method detects a trigger to invoke averify operation. A verify operation can be automatically triggered whenthe number of stale entries in a FPDS reaches or exceeds a stale entriesthreshold, for example, when a number of stale fingerprint entries arebeyond 20%. In another example, a verify operation is triggered from aCLI. In another example, a verify operation is user-driven, for example,by the method detecting instructions entered by a user via a commandline interface.

During a previously executed de-duplication operation where the methodidentifies and eliminates duplicate data blocks, the fingerprint entriesthat correspond to the eliminated duplicate data blocks were written toa third datastore. A secondary FPDS contains fingerprints entries thathave the same fingerprint value as an entry (e.g., record) in theprimary FPDS. At instruction block 1703, the method performs anin-memory merge of a third datastore with a secondary FPDS to create anupdated secondary FPDS, according to one embodiment. In anotherembodiment, the third datastore is a secondary datastore (e.g.,tempfile.x) for each de-dupe operation between verify operations. In oneembodiment, the secondary datastore (e.g., tempfile.x) is a segmenteddatastore, as described in conjunction with FIG. 11A. All tempfile.xfiles can be merged on-disk and sorted in <inode, fbn> order to optimizea verify operation.

During a previously executed de-duplication operation, an originalprimary FPDS is overwritten with fingerprint entries that correspond tothe unique data blocks to create an updated primary FPDS. At instructionblock 1705, the method performs an in-memory merge of the entries in theupdated primary FPDS with the entries from the in-memory merge of thesecondary FPDS and third datastore, according to one embodiment. Inanother embodiment, the entries in the updated primary FPDS are mergedin-memory with the entries of the on-disk merged tempfile.x files. Atinstruction block 1707, the method removes stale entries from the mergeddata. The method can identify stale entries using context information,such as the value of a consistency point counter at the time the blockwas written to a storage device.

The method writes the remaining stale-free entries to a stale-freedatastore and sorts the stale-free data by fingerprint at instructionblock 1709. At instruction block 1711, the method identifies the entriesthat correspond to duplicate data blocks and writes the identifiedentries to a second datastore to create an updated secondary FPDS. Atinstruction block 1713, the method overwrites the existing primary FPDSwith the fingerprint entries for the unique data blocks to create anupdated primary FPDS. One embodiment for dividing a FPDS into a primaryFPDS and secondary FPDS is described in detail in conjunction with inFIGS. 8A-8B.

FIG. 18 is a flow diagram of a method 1800 for executing a verifyoperation (stale fingerprint entry removal) using VVBNs (virtual volumeblock numbers), according to certain embodiments. Method 1800 can beperformed by processing logic that can comprise hardware (e.g.,circuitry, dedicated logic, programmable logic, microcode, etc.),software (e.g., instructions run on a processing device), or acombination thereof. In one embodiment, method 1800 is performed by ade-duplication module (e.g., de-duplication module 390 in FIG. 3) hostedby storage servers 110 of FIG. 1A.

A de-duplication module is coupled to a sorted FPDS. The FPDS stores anentry (e.g., fingerprint record) for each data block that is written tothe storage (e.g., storage 170A,B in FIG. 1A). According to certainembodiments, a FPDS is improved by referencing a data block in a volumeuniquely using a virtual volume block number (VVBN) instead of using<inode,fbn>. By using a VVBN to refer to a data block, a FPDS can moreeasily scale with increased block sharing.

Typically, an entry in a FPDS can include, and is not limited to, thefingerprint of the block (‘fp’), context information, such as, the valueof a consistency point counter (e.g., a generation time stamp(‘cp-cnt’)) at the time the block was written to a storage device, andlogical data, such as the inode number of the file to which the blockbelongs (‘inode’) and the FBN (file block number) of the block (‘fbn’).According to certain embodiments, an entry (e.g., fingerprint record)can also include physical data, such as a container file identifier(‘Container-FileID’) and the VVBN (virtual volume block number) of theblock (‘vvbn’). A FPDS is then reduced to a map file that can be indexedby VVBN, according to certain embodiments. Instead of or in addition tousing an inode and FBN to refer to a block in a volume, the method 1800can use VVBN to refer to a block.

At instruction block 1801, the method detects a trigger to invoke averify operation. A verify operation can be automatically triggered whenthe number of stale entries in a fingerprints datastore (FPDS) reachesor exceeds a stale entries threshold, for example, when a number ofstale fingerprint entries are beyond 20%. In another example, a verifyoperation is triggered from a CLI. In another example, a verifyoperation is user-driven, for example, by the method detectinginstructions entered by a user via a command line interface.

At instruction block 1803, the method sorts the entries in the FPDS. Themethod sorts the entries in the FPDS using by VVBN and the contextinformation (e.g., by <vvbn, cp-cnt> order). Sorting by VVBN ensuresthat only the latest copy of the VVBN (e.g., one with highest cp-cnt) isretained in the FPDS.

At instruction block 1805, the method identifies the stale fingerprintentries from the context information for each entry. Fingerprint entriesin the FPDS with higher consistency point counter values (e.g., cp-cnt)are more recent than entries with lower consistency point countervalues. The method identifies entries that correspond to freed datablocks using <vvbn> and identifies entries that have the same <vvbn> asother entries, but have a lower consistency point counter (e.g., cp-cnt)value compared to the other entries, as stale entries. The unidentifiedentries are stale free fingerprint entries. Complementary to thisfunctionality, information on the deleted files and/or blocks in thedeletion code path is also logged and used to clean up stale entries.

At instruction block 1807, the method removes (purges) the stalefingerprint entries. At instruction block 1809, the method examines astale-free entry to determine whether it has a valid VVBN. The methodcan examine an active map to ensure a VVBN is valid (ensure that a VVBNhas not changed). If the method does not confirm that that the VVBN isvalid, it can delete (purge) the stale fingerprint entry at instructionblock 1811. If the method confirms that the VVBN is valid, the methoddetermines whether the fingerprint entry is a logical entry atinstruction block 1813. A fingerprint entry can include data indicatingthe type (e.g., physical, logical) of entry. In one embodiment, atinstruction block 1813, the method also checks a ‘refcount’ for the VVBNto ensure that the VVBN is shared.

If an entry is not a logical entry (instruction block 1813), the methodwrites it to the FPDS as is at instruction block 1817, that is, as aphysical entry. If an entry is a logical entry (instruction block 1813),the method converts it to a physical entry at instruction block 1815 andwrites the physical entry to the FPDS at instruction block 1817. Atinstruction block 1819, the method determines whether to validateanother stale-free entry.

FIG. 19 illustrates the elements of a de-duplication module 1900 (e.g.,de-duplication module 390 in FIG. 3) for executing a verify operation(stale fingerprint record removal) as a background operation, accordingto certain embodiments. The elements include a stale fingerprint manager1901 that includes a verify trigger detector 1903, a verify manager1905, a data sorter 1907, a checkpoint creator 1913, a stale entryidentifier 1909, a stale entry manager 1911, and a data sorter 1907. Theelements of the de-duplication module 1900 also include a de-duplicationengine 1951, a gatherer 1953, a fingerprint manager 1955, a blocksharing engine 1957, and a fingerprint handler 1959.

Typically, a verify operation (stale entries removal operation) is ablocking operation, that is, if a verify operation is executing on aFPDS, then no other de-duplication (sharing) operation can run becauseall de-duplication operations should work from a consistent copy of aFPDS. One aspect de-duplication makes a verify operation a backgroundjob so that if any de-duplication operation request is made while anyverify operation is executing, the de-duplication request can be served,to help decrease customer response time, and to help not lose any spacesavings due to not being able to run a de-duplication operation.

The stale fingerprint manager 1901 is coupled to a fingerprintsdatastore (FPDS) 1915 that stores an entry (e.g., fingerprint record)for each data block that has been written to storage (e.g., storage170A,B in FIG. 1A). One embodiment of a FPDS that is divided intomultiple parts, such as a primary datastore and a secondary datastore,is described in greater detail in conjunction with FIGS. 9A-9B. Anotherembodiment of a FPDS that is organized into segments is described ingreater detail in conjunction with FIGS. 11A-11B.

In one embodiment, an entry (e.g., fingerprint record) can include, andis not limited to, the fingerprint of the block (‘fp’), contextinformation, such as, the value of a consistency point counter (e.g., ageneration time stamp (‘cp-cnt’)) at the time the block was written to astorage device, and logical data, such as the inode number of the fileto which the block belongs (‘inode’) and the FBN (file block number) ofthe block (‘fbn’). In another embodiment, a fingerprint entry caninclude, and is not limited to, the fingerprint of the block (‘fp’),context information, such as, the value of a consistency point counter(e.g., a generation time stamp (‘cp-cnt’)) at the time the block waswritten to a storage device, and physical data, such as a container fileidentifier (Container-FileID) and the VVBN (virtual volume block number)of the block (‘vvbn’).

A verify operation (stale entries removal operation) can beautomatically triggered when the number of stale entries in a FPDS 1915reaches or exceeds a stale entries threshold, for example, when a numberof stale fingerprint entries in a FPDS 1915 is beyond 20%. The verifytrigger detector 1903 determines a current number of stale entries in aFPDS 1915, for example, by examining a stale entries datastore that isstored in a data store 1917 that is coupled to the stale fingerprintmanager 1901. The verify detector 1903 compares the current number ofstale entries to a stale entries threshold. The stale entries thresholdcan be a user-defined threshold stored as a parameter in the data store1917. In another example, a verify operation is triggered from a CLI. Inanother example, a verify operation is user-driven, for example, by theverify trigger detector 1903 receiving instructions entered by a uservia a command line interface.

During a verify stage one, when the verify trigger detector 1903 detectsa trigger to execute a verify operation, the verify operation manager1905 executes a verify operation by invoking a data sorter 1907 to sortthe entries in the FPDS 1915. In one embodiment, the data sorter 1907sorts the entries in the FPDS 1915 using logical data and the contextinformation (e.g., by <file identifier, block offset in a file, timestamp>, such as <inode, fbn, cp-cnt> order). In another embodiment, thedata sorter 1907 sorts the entries in the FPDS 1915 using physical dataand the context information (e.g., by <vvbn, cp-cnt> order).

The stale entry identifier 1909 uses context information that is storedin the FPDS 1915 to identify stale fingerprint entries. Fingerprintentries having higher consistency point counter values are more recentthan entries with lower consistency point counter values. In oneembodiment, the stale entry identifier 1909 identifies fingerprintentries having the same <inode, fbn> as other entries, but with lowerconsistency point counter values compared to the other entries, as stalefingerprint entries. In another embodiment, the stale entry identifier1909 identifies fingerprint entries having the same <vvbn> as otherentries, but with lower consistency point counter values compared to theother entries, as stale fingerprint entries. The unidentified entriesare stale free fingerprint entries. In one embodiment, the stale entrymanager 1911 creates a stale entries datastore and stores it in the datastore 1917. One embodiment of a stale entry manager creating a staleentries datastore is described in conjunction with FIG. 15.

During verify stage two, the data sorter 1907 sorts the entries in thestale entries datastore. Subsequently, when a de-duplication process isinvoked, the stale entries datastore can be used to remove the stalefingerprint entries from the FPDS 1915.

While a verify operation is executing, a de-duplication engine 1951 thatis coupled to the stale fingerprint manager 1901 monitors for ade-duplication operation start request (e.g., sis start command). In oneembodiment, when the de-duplication engine 1951 detects a de-duplicationstart request, it notifies the verify operation manager 1905 and returnsa success message to a user in response to the de-duplication startrequest. The de-duplication engine 1951 adds a message to a queue 1910,which is coupled to the de-duplication engine 1951, for a de-duplicationjob to be performed in response to the de-duplication start request. Thequeue 1910 can be a data store.

The verify operation manager 1905 receives the notification from thede-duplication engine 1951 and monitors for a checkpoint creation. Acheckpoint is a point in time during execution of a verify operation inwhich the verify operation manager 1905 can pause the verify operation.A checkpoint can be a user-defined point in time. A checkpoint creator1913 can be configured to create checkpoints according to a user-definedparameter that is stored in the data store 1917. In one embodiment, thecheckpoint creator 1913 creates a first checkpoint during verify stageone, for example, after the stale entry identifier 1909 identifies thestale fingerprint entries. The checkpoint creator 1913 can create morethan one checkpoint. For example, the checkpoint creator 1913 creates asecond checkpoint during verify stage two after the stale entriesdatastore is sorted.

When the verify operation manager 1905 detects that the checkpointcreator 1913 creates a checkpoint, the verify operation manager 1905determines whether to suspend a verify operation that is currentlyexecuting. The verify operation manager 1905 examines the queue 1910,which is coupled to the verification operation manager 1905, todetermine whether there are any pending de-duplication jobs to beperformed and if so, suspend the verify operation. In one embodiment,the verify operation manager 1905 marks the FPDS 1915 as read-only,stops the verify operation, and saves it in its current state to astorage device. The verify operation manager 1905 adds a message to thequeue 1910 for the verify operation job to be resumed and notifies thede-duplication engine 1951 to invoke the de-duplication operation.

The de-duplication engine 1910 triggers operations of the other modules,such as the gatherer module 1953, fingerprint handler 1959, fingerprintmanager 1955, and block sharing engine 1957 to execute a de-duplicationoperation for identifying and eliminating duplication data blocks.Eliminating the duplicate data blocks includes sharing the remaininginstance of each data block that was duplicated and freeing the (nolonger used) duplicate data block(s). Embodiments of the modulesexecuting a de-duplication operation are described in detail inconjunction with FIG. 7, FIG. 10, and FIG. 12. Returning to FIG. 19, thefingerprint manager 1955 merges the entries in the FPDS 1915 withentries in a changelog, and the block sharing engine 1957 identifies andeliminates the duplicate data blocks. In one embodiment, the fingerprintmanager 1955 determines that the FPDS 1915 is marked as read-only andwrites the merged data to a new FPDS to create a shadow copy of the FPDS1915. Subsequently, when a verify operation is complete, the shadow copycan be merged with the original FPDS 1915. The de-duplication engine1910 clears the message corresponding to the completed de-duplicationoperation from the queue 1910.

The de-duplication engine 1951 determines whether there is anotherde-duplication job in the queue 1910 to be performed. If not, thede-duplication engine 1951 notifies the verify operation manager 1905 toresume the suspended verify operation. The verify operation manager 1905receives the notification from the de-duplication engine 1951, marks theFPDS 1915 as read/write, and restarts the verify job from its savedstate from a storage device.

FIG. 20 is a flow diagram of a method 2000 for executing a verifyoperation (stale fingerprint record removal) as a background operation,according to certain embodiments. Method 2000 can be performed byprocessing logic that can comprise hardware (e.g., circuitry, dedicatedlogic, programmable logic, microcode, etc.), software (e.g.,instructions run on a processing device), or a combination thereof. Inone embodiment, method 2000 is performed by a de-duplication module(e.g., de-duplication module 390 in FIG. 3) hosted by storage servers110 of FIG. 1A.

The de-duplication module is coupled to storage (e.g., storage 170A,B inFIG. 1A) storing data blocks of data, and generates a fingerprint foreach data block in the storage. The de-duplication module is coupled toa fingerprints datastore (FPDS) that stores an entry (e.g., fingerprintrecord) for each data block that has been written to storage. In oneembodiment, the FPDS stores an entry (e.g., fingerprint record) for eachunique fingerprint. One embodiment of a FPDS that is divided intomultiple parts, such as a primary datastore and a secondary datastore,is described in greater detail in conjunction with FIGS. 9A-9B. Anotherembodiment of a FPDS that is organized into segments is described ingreater detail in conjunction with FIGS. 11A-11B.

A verify operation (stale entries removal operation) can beautomatically triggered when the number of stale entries in a FPDSreaches or exceeds a stale entries threshold, for example, when a numberof stale fingerprint entries in a FPDS 2015 is beyond 20%. The method2000 determines a current number of stale entries in a FPDS, forexample, by examining a stale entries datastore that is stored in a datastore that is coupled to the de-duplication module. The method comparesthe current number of stale entries to a stale entries threshold. Thestale entries threshold can be a user-defined threshold stored as aparameter in the data store. In another example, a verify operation istriggered from a CLI. In another example, a verify operation isuser-driven, for example, by the method detecting instructions enteredby a user via a command line interface.

At instruction block 2001, the method detects a trigger to execute averify operation and executes the verify operation at instruction block2003. While a verify operation is executing, the method monitors for ade-duplication operation start request (e.g., sis start command) atinstruction block 2005. The method determines whether the verifyoperation is finished at instruction block 2007. If the verify operationis not finished, the method determines whether a checkpoint is beingcreated during the verification operation at instruction block 2009. Acheckpoint is a point in time during execution of a verify operation inwhich the method can pause the verify operation. A checkpoint can be auser-defined point in time. In one embodiment, the method creates afirst checkpoint during verify stage one, for example, after the methodidentifies the stale fingerprint entries. The method can create morethan one checkpoint during a verify operation. For example, the methodcreates a second checkpoint during verify stage two after the methodidentifies the stale entries.

If a checkpoint is not created (instruction block 2009), the methoddetermines whether there is a de-duplication operation start request(e.g., sis start command) at instruction block 2013. If there if not ade-duplication operation start request (e.g., sis start command), themethod continues to execute the verify operation at instruction block2003.

When a checkpoint is created (instruction block 2009), the methoddetermines whether to suspend the verify job that is currently executingat instruction block 2011. The method examines the queue to determinewhether there are any pending de-duplication jobs to be performed. Ifthere are no pending de-duplication jobs, the de-duplication does notsuspend the verify operation and determines whether there is ade-duplication start request at instruction block 2013.

When the method detects a de-duplication start request (e.g., sis startcommand), the method returns a success message to a user at instructionblock 2015. A success message can include, for example, data indicatingthe de-duplication request is successfully received, data indicating thede-duplication is to be performed, etc. At instruction block 2017, themethod adds a message to a queue, which is coupled to the method, for ade-duplication job to be performed. At instruction block 2013, while theverify operation continues to execute, the method monitors for when tocreate a checkpoint. If a checkpoint is not created (instruction block2021), the verify operation continues to execute and the methodcontinues to monitor for a checkpoint at instruction block 2019.

When a checkpoint is created (instruction block 2021), the methoddetermines whether to suspend the verify job that is currently executingat instruction block 2011. If there is a de-duplication job that ispending in the queue (instruction block 2011), in one embodiment, themethod marks the FPDS as read-only at instruction block 2023. In anotherembodiment, the method does not mark the FPDS as read-only and continuesto stop the verify operation at instruction block 2025. For example, insome embodiments, a method does not overwrite a complete FPDS. Forinstance, a method only overwrites a segment of FPDS, and subsequently,all such segments are merged to avoid write cost. Furthermore, in someembodiment, a verify operation runs on a backup copy of a FPDS orderedor on primary copy of FPDS (e.g., primary FPDS) which is ordered inde-duplication friendly order.

At instruction block 2025, the method stops the verify operation andsaves the verify job in its current state to a storage device atinstruction block 2027. At instruction block 2029, the method adds amessage to the queue for the verify operation job to be resumed andinvokes the de-duplication operation at instruction block 2031.

At instruction block 2035, the method merges the entries in the FPDSwith entries in a changelog, and identifies and eliminates the duplicatedata blocks at instruction block 2037. In one embodiment, methoddetermines that the FPDS is marked as read-only at instruction block2039, and writes the merged data to a new FPDS to create a shadow copyof the FPDS at instruction block 2041. Subsequently, when a verifyoperation is complete, the shadow copy can be merged with the originalFPDS.

The method clears the message corresponding to the completedde-duplication operation from the queue. At instruction block 2045, themethod determines whether there is another de-duplication job in thequeue to be performed. If not, the method resumes the verify operationthat is in the queue by restarting the verify job that is saved from astorage device at instruction block 2045. In one embodiment, the methodmarks the FPDS as read/write. Upon completion, the method clears themessage corresponding to the completed verify operation from the queue.

FIG. 21 is a flow diagram of a method 2100 for computing a fingerprintfor a data block, according to certain embodiments. Method 2100 can beperformed by processing logic that can comprise hardware (e.g.,circuitry, dedicated logic, programmable logic, microcode, etc.),software (e.g., instructions run on a processing device), or acombination thereof. In one embodiment, method 2100 is performed by ade-duplication module (e.g., de-duplication module 390 in FIG. 3) hostedby storage servers 110 of FIG. 1A.

At instruction block 2101, the method detects a request to write data toa data block. The flow diagram illustrates an embodiment of computing afingerprint for a data block that operates concurrently with theoperation of writing the blocks to storage (e.g., storage 170A,B in FIG.1A). Instructional blocks 2103 and 2105 are performed concurrently withinstructional block 2107. In alternative embodiments, computing thefingerprint and writing the fingerprint to the changelog file is notperformed concurrently with writing the block to a storage device. Atinstruction block 2103, the fingerprint handler computes a fingerprintfor the block. The fingerprint is passed to the fingerprint manager,which writes an entry (e.g., record) for the block in the changelog fileat instruction block 2105 including, but limited to, the fingerprint,the FBN (file block number), the inode number of the block, and otherrelevant context information that is specific to this block, such as thevalue of a consistency point counter at the time the block was writtento a storage device.

FIG. 22 is a block diagram for sorting fingerprint entries in afingerprints datastore and a changelog, according to certainembodiments. First, the fingerprints datastore 2201 is divided into somenumber, N, of approximately equal-sized chunks 2205. Each of the Nchunks 2205 is then independently sorted by fingerprint value, using anyconventional sorting algorithm, such as Quicksort, for example. The datasorting then compares the fingerprints in the entries of the same rankin all of the N chunks 2205 (e.g., the top entry in each of the N chunks2205) and copies the entry which has the smallest fingerprint value fromamong those into the next available slot in the sorted output file 2202.The output file 2202 becomes the sorted fingerprints datastore 2203 whenthe sorting operation is complete. This process is then repeated untilall of the entries in the N sorted chunks 2205 have been copied into theoutput file 2202.

FIG. 23 is a flow diagram of a method 2300 for freeing a data block,such as a duplicate block, according to certain embodiments. Method 2300can be performed by processing logic that can comprise hardware (e.g.,circuitry, dedicated logic, programmable logic, microcode, etc.),software (e.g., instructions run on a processing device), or acombination thereof. In one embodiment, method 2300 is performed by ade-duplication module (e.g., de-duplication module 390 in FIG. 3) hostedby storage servers 110 of FIG. 1A.

At instruction block 2301, the method determines whether the SHARED flagis set for the file which contains the block to be freed. If the SHAREDflag is not set (meaning that no blocks in the file are shared), theprocess proceeds to instruction block 2307, in which the bitcorresponding to the block is cleared in the active map. The processthen ends. An active map is a bitmap of all data blocks managed by astorage server, i.e., one bit per data block. The bit for a given datablock is set in the active map if the data block is allocated andcleared if the data block is free to be used. The active map is usedduring allocation of blocks to determine whether a block is free or not.The active map helps to improve performance by avoiding the need to readthe reference count file to identify free blocks.

A reference count file is much larger (and therefore takes longer toread) than the active map. The reference count file contains an entry(e.g., record) for each data block maintained by the storage server,wherein each entry includes a value, REFCOUNT, indicating the number ofreferences to that data block. In one embodiment, however, the activemap and the reference count file could be combined into a single file toidentify each free block as well as to indicate the number of referencesto the data block.

If the SHARED flag is set (instruction block 2301), then at instructionblock 2303 the process decrements the REFCOUNT value for the block byone in the reference count file. After decrementing the REFCOUNT value,the process determines at instruction block 2305 whether the REFCOUNTvalue is zero. If the REFCOUNT value is zero (meaning that the block isno longer used), the process clears the corresponding bit in the activemap and then ends. A data block that is freed can be reused. If theREFCOUNT value is determined to be non-zero (instruction block 2305),the process finishes.

In certain embodiments, the system also maintains a change log toidentify blocks that are new or modified since the last time ade-duplication operation was executed. The change log containsinformation of the same type as the fingerprints datastore (i.e.,fingerprint of the block, inode number of the file to which the blockbelongs, and the FBN of the block), but only for new or modified blocks.From time to time, the system then re-executes the sorting process ofFIG. 22 on both the fingerprints datastore and the change log, to mergethe change log into the fingerprints datastore. In alternativeembodiments, the system could simply from time to time scan the entirefile system, compute the fingerprints of all data blocks, and eliminateduplicates at essentially the same time.

The particular methods of the de-duplication module have been describedin terms of computer software with reference to a series of flowdiagrams. FIGS. 4C, 8B, 9B, 11B, 14B, 16B, 17B, 18, 20, 21, and 23relate to computer(s) in FIG. 2. The methods constitute computerprograms made up of computer-executable instructions illustrated asblocks (acts) in FIGS. 1-23. Describing the methods by reference to aflow diagram enables one skilled in the art to develop such programsincluding such instructions to carry out the methods on suitablyconfigured machines (the processor of the machine executing theinstructions from computer-readable media, including memory) Thecomputer-executable instructions may be written in a computerprogramming language or may be embodied in firmware logic. If written ina programming language conforming to a recognized standard, suchinstructions can be executed on a variety of hardware platforms and forinterface to a variety of operating systems. In addition, the presentinvention is not described with reference to any particular programminglanguage. It will be appreciated that a variety of programming languagesmay be used to implement the teachings of the invention as describedherein. Furthermore, it is common in the art to speak of software, inone form or another (e.g., program, procedure, process, application,module, logic . . . ), as taking an action or causing a result. Suchexpressions are merely a shorthand way of saying that execution of thesoftware by a computer causes the processor of the computer to performan action or produce a result. It will be appreciated that more or fewerprocesses may be incorporated into the methods illustrated in FIGS. 4C,8B, 9B, 11B, 14B, 16B, 17B, 18, 20, 21, and 23 without departing fromthe scope of the invention and that no particular order is implied bythe arrangement of blocks shown and described herein.

What is claimed is:
 1. A method comprising: generating a fingerprint foreach of a plurality of data blocks stored in a storage device; storingthe fingerprint in a fingerprints datastore; identifying stale entriesin the fingerprints datastore, the stale entries corresponding toduplicate data blocks eliminated from the storage device by ade-duplication operation; writing stale entry information for the staleentries to a stale entries datastore; and removing the stale entries inthe fingerprints datastore using the stale entry information in responseto a request for the de-duplication operation.
 2. The method of claim 1,wherein the fingerprints datastore includes a primary datastore and asecondary datastore, and wherein storing the fingerprint in thefingerprints datastore includes: determining whether the fingerprintcorresponds to a duplicate data block of the duplicate data blocks; inresponse to determining that the fingerprint does not correspond to aduplicate data block of the duplicate data blocks, storing thefingerprint as a first entry in the primary datastore; and in responseto determining that the fingerprint does correspond to a duplicate datablock of the duplicate data blocks, storing the fingerprint as a secondentry in the secondary datastore.
 3. The method of claim 2, whereinremoving the stale entries in the fingerprints datastore comprisesremoving the stale entries from the secondary datastore.
 4. The methodof claim 1, further comprising: overwriting the fingerprints datastorewith stale-free entries; and executing the de-duplication operationusing the fingerprints datastore having the stale-free entries.
 5. Themethod of claim 1, wherein removing the stale entries in thefingerprints datastore using the stale entry information comprises:identifying entries in the fingerprints datastore that correspond to thestale entries in the stale entries datastore as stale entries.
 6. Themethod of claim 1, wherein the stale entry information includes a staleentry for a data block of the plurality of data blocks, the stale entryincluding an entry index for the stale entry, the fingerprint of thedata block, context information for the data block, an inode number of afile to which the data block belongs, and a file block number of thedata block.
 7. The method of claim 1, further comprising: in response todetermining that the request for the de-duplication operation isreceived during execution of a verify operation, suspending the verifyoperation, storing a current state of the verify operation to thestorage device, and marking the fingerprints datastore as read-only. 8.A storage apparatus comprising: a storage interface to communicate witha storage device; one or more processors communicably coupled to thestorage interface; a memory to store instructions that, when executed bythe one or more processors, cause the storage apparatus to: generate afingerprint for each of a plurality of data blocks stored in the storagedevice; store the fingerprint in a fingerprints datastore; identifystale entries in the fingerprints datastore, the stale entriescorresponding to duplicate data blocks eliminated from the storagedevice by a de-duplication operation; write stale entry information forthe stale entries to a stale entries datastore; and remove the staleentries in the fingerprints datastore using the stale entry informationin response to a request for the de-duplication operation.
 9. Thestorage apparatus of claim 8, wherein the fingerprints datastoreincludes a primary datastore and a secondary datastore, and wherein theinstructions to cause the storage apparatus to store the fingerprint inthe fingerprints datastore include instructions to cause the storageapparatus to: determine whether the fingerprint corresponds to aduplicate data block of the duplicate data blocks; in response to adetermination that the fingerprint does not correspond to a duplicatedata block of the duplicate data blocks, store the fingerprint as afirst entry in the primary datastore; and in response to a determinationthat the fingerprint does correspond to a duplicate data block of theduplicate data blocks, store the fingerprint as a second entry in thesecondary datastore.
 10. The storage apparatus of claim 8, wherein theinstructions further include instructions to cause the storage apparatusto: overwrite the fingerprints datastore with stale-free entries; andexecute the de-duplication operation using the fingerprints datastorehaving the stale-free entries.
 11. The storage apparatus of claim 8,wherein the instructions to cause the storage apparatus to remove thestale entries in the fingerprints datastore using the stale entryinformation include instructions to cause the storage apparatus toidentify entries in the fingerprints datastore that correspond to theentries in the stale entries datastore as the stale entries in thefingerprints datastore.
 12. The storage apparatus of claim 8, whereinthe stale entry information includes a stale entry for a data block ofthe plurality of data blocks, the stale entry including an entry indexfor the stale entry, the fingerprint of the data block, contextinformation for the data block, an inode number of a file to which thedata block belongs, and a file block number of the data block.
 13. Thestorage apparatus of claim 8, wherein the instructions further includeinstructions to cause the storage apparatus to: in response todetermination that the request for the de-duplication operation isreceived during execution of a verify operation, suspend the verifyoperation, store a current state of the verify operation to the storagedevice, and mark the fingerprints datastore as read-only.
 14. Anon-transitory computer-readable medium having stored thereoninstructions, that when executed by one or more processors, cause theone or more processors to: generate a fingerprint for each of aplurality of data blocks stored in a storage device; store thefingerprint in a fingerprints datastore; identify stale entries in thefingerprints datastore, the stale entries corresponding to duplicatedata blocks eliminated from the storage device by a de-duplicationoperation; write stale entry information for the stale entries to astale entries datastore; and remove the stale entries in thefingerprints datastore using the stale entry information in response toa request for the de-duplication operation.
 15. The non-transitorycomputer-readable medium of claim 14, wherein the fingerprints datastoreincludes a primary datastore and a secondary datastore, and wherein theinstructions to cause the one or more processors to store thefingerprint in the fingerprints datastore include instructions to causethe one or more processors to: determine whether the fingerprintcorresponds to a duplicate data block of the duplicate data blocks; inresponse to a determination that the fingerprint does not correspond toa duplicate data block of the duplicate data blocks, store thefingerprint as a first entry in the primary datastore; and in responseto a determination that the fingerprint does correspond to a duplicatedata block of the duplicate data blocks, store the fingerprint as asecond entry in the secondary datastore.
 16. The non-transitorycomputer-readable medium of claim 15, wherein the instructions to causethe one or more processors to remove the stale entries in thefingerprints datastore include instructions to cause the one or moreprocessors to remove the stale entries from the secondary datastore. 17.The non-transitory computer-readable medium of claim 14, wherein theinstructions further include instructions to cause the one or moreprocessors to: overwrite the fingerprints datastore with stale-freeentries; and execute the de-duplication operation using the fingerprintsdatastore having the stale-free entries.
 18. The non-transitorycomputer-readable medium of claim 14, wherein the instructions to causethe one or more processors to remove the stale entries in thefingerprints datastore using the stale entry information includeinstructions to cause the one or more processors to identify entries inthe fingerprints datastore that correspond to the entries in the staleentries datastore as stale entries.
 19. The non-transitorycomputer-readable medium of claim 14, wherein the stale entryinformation includes a stale entry for a data block of the plurality ofdata blocks, the stale entry including an entry index for the staleentry, the fingerprint of the data block, context information for thedata block, an inode number of a file to which the data block belongs,and a file block number of the data block.
 20. The non-transitorycomputer-readable medium of claim 14, wherein the instructions furtherinclude instructions to cause the one or more processors to: in responseto a determination that the request for the de-duplication operation isreceived during execution of a verify operation, suspend the verifyoperation, store a current state of the verify operation to the storagedevice, and mark the fingerprints datastore as read-only.